Back to Blog

Using LLMs to Write E2E Tests: Generate Production-Quality Test Suites in Minutes

GPT-4 and Claude can generate complete Playwright test suites from natural language descriptions. But do AI-generated tests actually work in production? This guide explores the reality, limitations, and best practices of using LLMs for test automation.

Published

13 min read

Reading time

Using LLMs to Write E2E Tests: Generate Production-Quality Test Suites in Minutes

"Write comprehensive Playwright tests for user authentication including login, signup, password reset, and edge cases."

You press Enter. Ten seconds later, GPT-4 outputs 300 lines of working test code covering 15 scenarios you hadn't even thought of. You copy-paste it. It runs. It passes. You just saved 4 hours of work.

This isn't science fiction—it's 2027.

But here's what they don't tell you: Those tests fail next month when the UI changes. The AI missed a critical security edge case. The generated code has subtle race conditions that make tests flaky. And you have no idea what the tests actually validate because you didn't write them.

LLMs can write tests faster than humans, but they can't replace QA thinking.

This guide shows you how to leverage LLMs to dramatically accelerate test creation while avoiding the pitfalls that make AI-generated tests a maintenance nightmare.

What LLMs Are Actually Good At

graph LR
    A[LLM Strengths] --> B[Pattern Recognition]
    A --> C[Code Generation]
    A --> D[Boilerplate]
    A --> E[Common Scenarios]

    F[LLM Weaknesses] --> G[Domain Context]
    F --> H[Edge Cases]
    F --> I[Business Logic]
    F --> J[Strategic Thinking]

    style A fill:#c5e1a5
    style F fill:#ffccbc

    B --> K[✅ Recognizes test patterns<br/>from training data]
    C --> L[✅ Generates syntactically<br/>correct code]
    D --> M[✅ Writes setup/teardown<br/>boilerplate]
    E --> N[✅ Covers happy path &<br/>obvious errors]

    G --> O[❌ Doesn't know your<br/>specific app]
    H --> P[❌ Misses subtle<br/>edge cases]
    I --> Q[❌ Can't understand<br/>business requirements]
    J --> R[❌ Can't prioritize<br/>what to test]

Strength vs Weakness Comparison

Task LLM Performance Why
Generate basic CRUD tests ★★★★★ Excellent Pattern well-known from training data
Write test boilerplate ★★★★★ Excellent Repetitive structure, clear patterns
Cover happy path ★★★★☆ Very Good Obvious scenarios, standard flows
Add common validations ★★★★☆ Very Good Trained on best practices
Generate edge cases ★★★☆☆ Moderate Generic edges, misses domain-specific
Test security vulnerabilities ★★☆☆☆ Poor Requires security domain knowledge
Domain-specific testing ★★☆☆☆ Poor No context about your app
Strategic test prioritization ★☆☆☆☆ Very Poor Can't assess business risk

The LLM Test Generation Workflow

graph TD
    A[Feature Requirement] --> B[Human: Define Test Strategy]
    B --> C[Human: Write Prompt]
    C --> D[LLM: Generate Tests]
    D --> E[Human: Code Review]
    E --> F{Quality Check}

    F -->|Good| G[Human: Add Edge Cases]
    F -->|Issues| H[Human: Refine Prompt]
    H --> D

    G --> I[Human: Add Assertions]
    I --> J[Run Tests]
    J --> K{Tests Pass?}

    K -->|Yes| L[Human: Exploratory Testing]
    K -->|No| M[Debug & Fix]
    M --> J

    L --> N[Commit Tests]
    N --> O[LLM: Generate Documentation]

    style B fill:#bbdefb
    style C fill:#bbdefb
    style E fill:#bbdefb
    style G fill:#bbdefb
    style I fill:#bbdefb
    style L fill:#bbdefb

Implementation: AI Test Generator

1. AI-Powered QA Test Generation Techniques

// llm-test-generator.ts
interface TestGenerationPrompt {
  feature: string;
  userStory: string;
  acceptanceCriteria: string[];
  technicalContext: {
    framework: 'playwright' | 'cypress' | 'selenium';
    language: 'typescript' | 'javascript';
    pageObjects: string[];
  };
  existingTests?: string; // For context
}

class LLMTestGenerator {
  private apiKey: string;

  constructor(apiKey: string) {
    this.apiKey = apiKey;
  }

  async generateTests(prompt: TestGenerationPrompt): Promise<string> {
    const systemPrompt = this.buildSystemPrompt();
    const userPrompt = this.buildUserPrompt(prompt);

    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${this.apiKey}`,
      },
      body: JSON.stringify({
        model: 'gpt-4-turbo',
        messages: [
          { role: 'system', content: systemPrompt },
          { role: 'user', content: userPrompt },
        ],
        temperature: 0.3, // Lower temperature for more consistent code
        max_tokens: 4000,
      }),
    });

    const data = await response.json();
    return this.extractCode(data.choices[0].message.content);
  }

  private buildSystemPrompt(): string {
    return `You are an expert QA engineer specializing in end-to-end test automation.

Your task is to generate comprehensive, production-ready Playwright tests in TypeScript.

CRITICAL REQUIREMENTS:
1. Use ONLY getByRole, getByLabel, getByText (accessible selectors)
2. NEVER use CSS selectors or XPath unless absolutely necessary
3. Add explicit waits (waitForLoadState, waitForResponse) not waitForTimeout
4. Include meaningful error messages in assertions
5. Follow AAA pattern (Arrange, Act, Assert)
6. Add comments explaining complex test logic
7. Use page object pattern when dealing with multiple pages
8. Consider accessibility, performance, and edge cases
9. Add test.describe blocks for logical grouping
10. Each test must be independent and not rely on others

BEST PRACTICES:
- Use descriptive test names that explain expected behavior
- Add beforeEach hooks for common setup
- Use test.fixme() or test.skip() with explanations when needed
- Include both positive and negative test cases
- Test error states and validation messages
- Consider responsive design and different viewport sizes`;
  }

  private buildUserPrompt(prompt: TestGenerationPrompt): string {
    const { feature, userStory, acceptanceCriteria, technicalContext, existingTests } = prompt;

    return `Generate comprehensive E2E tests for the following feature:

FEATURE: ${feature}

USER STORY:
${userStory}

ACCEPTANCE CRITERIA:
${acceptanceCriteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}

TECHNICAL CONTEXT:
- Framework: ${technicalContext.framework}
- Language: ${technicalContext.language}
- Available Page Objects: ${technicalContext.pageObjects.join(', ')}

${existingTests ? `EXISTING TESTS (for context):\n\`\`\`typescript\n${existingTests}\n\`\`\`` : ''}

Generate tests that:
1. Cover all acceptance criteria
2. Include edge cases and error scenarios
3. Test accessibility (keyboard navigation, screen reader support)
4. Validate error messages and loading states
5. Are maintainable and follow best practices

Return ONLY the test code, no explanations.`;
  }

  private extractCode(content: string): string {
    // Extract code from markdown code blocks
    const match = content.match(/```(?:typescript|javascript)?\n([\s\S]*?)\n```/);
    return match ? match[1] : content;
  }
}

2. Intelligent Test Refinement

// test-refiner.ts
interface TestQualityAnalysis {
  score: number; // 0-100
  issues: Array<{
    severity: 'critical' | 'high' | 'medium' | 'low';
    type: string;
    description: string;
    suggestion: string;
  }>;
  strengths: string[];
}

class TestQualityAnalyzer {
  analyzeGeneratedTest(testCode: string): TestQualityAnalysis {
    const issues: TestQualityAnalysis['issues'] = [];
    const strengths: string[] = [];

    // Check for brittle selectors
    if (testCode.includes('.click()') && !testCode.includes('getByRole')) {
      issues.push({
        severity: 'high',
        type: 'brittle_selector',
        description: 'Using non-semantic selectors',
        suggestion: 'Replace with getByRole, getByLabel, or getByText for better maintainability',
      });
    } else {
      strengths.push('Uses semantic, accessible selectors');
    }

    // Check for hardcoded waits
    const hardcodedWaits = (testCode.match(/waitForTimeout\(/g) || []).length;
    if (hardcodedWaits > 0) {
      issues.push({
        severity: 'critical',
        type: 'flaky_wait',
        description: `Found ${hardcodedWaits} hardcoded wait(s)`,
        suggestion: 'Replace waitForTimeout with waitForLoadState or waitForSelector',
      });
    } else {
      strengths.push('Uses explicit waits instead of sleep/timeout');
    }

    // Check for meaningful assertions
    const assertions = (testCode.match(/expect\(/g) || []).length;
    if (assertions < 2) {
      issues.push({
        severity: 'high',
        type: 'weak_assertions',
        description: 'Too few assertions',
        suggestion: 'Add more assertions to validate expected behavior',
      });
    } else {
      strengths.push(`Contains ${assertions} assertions`);
    }

    // Check for test independence
    if (!testCode.includes('beforeEach') && testCode.split('test(').length > 3) {
      issues.push({
        severity: 'medium',
        type: 'missing_setup',
        description: 'Multiple tests without beforeEach setup',
        suggestion: 'Extract common setup to beforeEach hook',
      });
    }

    // Check for error handling
    if (testCode.includes('try {')) {
      strengths.push('Includes error handling');
    }

    // Check for accessibility testing
    if (testCode.includes('getByRole') || testCode.includes('getByLabel')) {
      strengths.push('Uses accessibility-first selectors');
    }

    // Calculate score
    const criticalCount = issues.filter((i) => i.severity === 'critical').length;
    const highCount = issues.filter((i) => i.severity === 'high').length;
    const mediumCount = issues.filter((i) => i.severity === 'medium').length;

    let score = 100;
    score -= criticalCount * 30;
    score -= highCount * 15;
    score -= mediumCount * 5;
    score = Math.max(0, score);

    return { score, issues, strengths };
  }

  async refineTest(testCode: string, analysis: TestQualityAnalysis): Promise<string> {
    if (analysis.score >= 80) {
      return testCode; // Good enough
    }

    // Use LLM to fix issues
    const generator = new LLMTestGenerator(process.env.OPENAI_API_KEY!);

    const refinementPrompt = `
Refine the following Playwright test to fix these issues:

${analysis.issues.map((issue) => `- [${issue.severity}] ${issue.description}: ${issue.suggestion}`).join('\n')}

ORIGINAL TEST:
\`\`\`typescript
${testCode}
\`\`\`

Return the improved test code that addresses all issues. Maintain the same test coverage.
`;

    return await generator.generateTests({
      feature: 'Test Refinement',
      userStory: refinementPrompt,
      acceptanceCriteria: analysis.issues.map((i) => i.suggestion),
      technicalContext: {
        framework: 'playwright',
        language: 'typescript',
        pageObjects: [],
      },
    });
  }
}

3. Context-Aware Test Generation

// context-aware-generator.ts
interface AppContext {
  pageStructure: Record<string, string[]>; // page -> elements
  apiEndpoints: string[];
  authRequired: boolean;
  userRoles: string[];
}

class ContextAwareTestGenerator {
  private generator: LLMTestGenerator;
  private analyzer: TestQualityAnalyzer;

  constructor(apiKey: string) {
    this.generator = new LLMTestGenerator(apiKey);
    this.analyzer = new TestQualityAnalyzer();
  }

  async generateWithContext(
    feature: string,
    userStory: string,
    context: AppContext,
  ): Promise<{ code: string; quality: TestQualityAnalysis }> {
    // Enrich prompt with application context
    const enrichedPrompt: TestGenerationPrompt = {
      feature,
      userStory,
      acceptanceCriteria: this.extractAcceptanceCriteria(userStory),
      technicalContext: {
        framework: 'playwright',
        language: 'typescript',
        pageObjects: Object.keys(context.pageStructure),
      },
      existingTests: this.generateContextExample(context),
    };

    // Generate tests
    let testCode = await this.generator.generateTests(enrichedPrompt);

    // Analyze quality
    let analysis = this.analyzer.analyzeGeneratedTest(testCode);

    // Refine if needed (up to 3 iterations)
    let iterations = 0;
    while (analysis.score < 70 && iterations < 3) {
      console.log(`Quality score: ${analysis.score}. Refining...`);
      testCode = await this.analyzer.refineTest(testCode, analysis);
      analysis = this.analyzer.analyzeGeneratedTest(testCode);
      iterations++;
    }

    console.log(`✅ Generated tests with quality score: ${analysis.score}`);

    return { code: testCode, quality: analysis };
  }

  private extractAcceptanceCriteria(userStory: string): string[] {
    // Simple extraction - in production, use more sophisticated parsing
    const lines = userStory.split('\n');
    return lines.filter((line) => line.trim().match(/^[-*]\s+/)).map((line) => line.replace(/^[-*]\s+/, '').trim());
  }

  private generateContextExample(context: AppContext): string {
    // Generate example tests showing app structure
    return `// Example showing app structure:
test('example', async ({ page }) => {
  ${
    context.authRequired
      ? `await page.goto('/login');
  await page.getByRole('button', { name: 'Login' }).click();`
      : ''
  }
  
  // Available pages: ${Object.keys(context.pageStructure).join(', ')}
  // API endpoints: ${context.apiEndpoints.slice(0, 3).join(', ')}
});`;
  }
}

4. Complete Test Generation Pipeline

// test-generation-pipeline.ts
import { writeFile } from 'fs/promises';
import { join } from 'path';

interface GeneratedTestSuite {
  filename: string;
  code: string;
  quality: TestQualityAnalysis;
  coverage: {
    scenarios: number;
    edgeCases: number;
    assertions: number;
  };
}

class TestGenerationPipeline {
  private generator: ContextAwareTestGenerator;

  constructor(apiKey: string) {
    this.generator = new ContextAwareTestGenerator(apiKey);
  }

  async generateTestSuite(feature: string, requirements: string, context: AppContext): Promise<GeneratedTestSuite> {
    console.log(`🤖 Generating tests for: ${feature}`);

    // Step 1: Generate tests with context
    const { code, quality } = await this.generator.generateWithContext(feature, requirements, context);

    // Step 2: Analyze coverage
    const coverage = this.analyzeCoverage(code);

    // Step 3: Add human review markers
    const annotatedCode = this.addReviewMarkers(code, quality);

    // Step 4: Save to file
    const filename = this.generateFilename(feature);
    await this.saveTest(filename, annotatedCode);

    console.log(`✅ Generated ${filename}`);
    console.log(`   Quality: ${quality.score}/100`);
    console.log(`   Coverage: ${coverage.scenarios} scenarios, ${coverage.assertions} assertions`);

    return { filename, code: annotatedCode, quality, coverage };
  }

  private analyzeCoverage(code: string): GeneratedTestSuite['coverage'] {
    return {
      scenarios: (code.match(/test\(/g) || []).length,
      edgeCases: (code.match(/edge case|boundary|invalid|error/gi) || []).length,
      assertions: (code.match(/expect\(/g) || []).length,
    };
  }

  private addReviewMarkers(code: string, quality: TestQualityAnalysis): string {
    let annotated = `/**
 * AUTO-GENERATED TEST SUITE
 * Generated at: ${new Date().toISOString()}
 * Quality Score: ${quality.score}/100
 * 
 * ⚠️  HUMAN REVIEW REQUIRED:
${quality.issues.map((issue) => ` * - [${issue.severity}] ${issue.description}`).join('\n')}
 * 
 * ✅ Strengths:
${quality.strengths.map((s) => ` * - ${s}`).join('\n')}
 */

${code}
`;

    // Add inline comments for critical issues
    quality.issues
      .filter((i) => i.severity === 'critical' || i.severity === 'high')
      .forEach((issue) => {
        // This is simplified - in production, use AST manipulation
        annotated = `// TODO: ${issue.description} - ${issue.suggestion}\n${annotated}`;
      });

    return annotated;
  }

  private generateFilename(feature: string): string {
    const slug = feature.toLowerCase().replace(/[^a-z0-9]+/g, '-');
    return `${slug}.spec.ts`;
  }

  private async saveTest(filename: string, code: string): Promise<void> {
    const filepath = join(process.cwd(), 'tests', 'generated', filename);
    await writeFile(filepath, code, 'utf-8');
  }
}

// Usage example
async function main() {
  const pipeline = new TestGenerationPipeline(process.env.OPENAI_API_KEY!);

  const context: AppContext = {
    pageStructure: {
      '/login': ['email input', 'password input', 'submit button'],
      '/dashboard': ['user menu', 'project list', 'create button'],
      '/settings': ['profile form', 'password form', 'delete button'],
    },
    apiEndpoints: ['/api/auth/login', '/api/projects', '/api/users'],
    authRequired: true,
    userRoles: ['user', 'admin'],
  };

  const suite = await pipeline.generateTestSuite(
    'User Authentication',
    `As a user, I want to log in securely so that I can access my dashboard.
    - User can log in with valid credentials
    - User sees error with invalid credentials
    - User is redirected to dashboard after successful login
    - User can reset forgotten password
    - Login form validates email format
    - Login attempts are rate-limited after 5 failures`,
    context,
  );

  console.log(`\n📊 Test Suite Summary:`);
  console.log(`   File: ${suite.filename}`);
  console.log(`   Quality: ${suite.quality.score}/100`);
  console.log(`   Scenarios: ${suite.coverage.scenarios}`);
  console.log(`   Assertions: ${suite.coverage.assertions}`);

  if (suite.quality.issues.length > 0) {
    console.log(`\n⚠️  Issues requiring review:`);
    suite.quality.issues.forEach((issue) => {
      console.log(`   [${issue.severity}] ${issue.description}`);
    });
  }
}

main().catch(console.error);

Real-World Results

Time Savings

Task Manual Time LLM-Assisted Savings
Simple CRUD tests 2 hours 15 minutes 87.5%
Complex user flows 6 hours 1.5 hours 75%
API integration tests 4 hours 45 minutes 81%
Accessibility tests 3 hours 30 minutes 83%
Error scenario tests 2 hours 20 minutes 83%
Overall average - - ~80%

Quality Metrics (After Human Review)

Metric LLM-Only LLM + Human Traditional
Test Coverage 85% 95% 92%
Flakiness Rate 12% 3% 5%
Maintenance Burden High Medium Medium
Edge Case Coverage 60% 90% 85%
Time to Create Fast Fast Slow

Best Practices for LLM Test Generation

✅ DO:

  1. Provide rich context: App structure, existing patterns, domain knowledge
  2. Review thoroughly: Never commit AI-generated code without review
  3. Iterate prompts: Refine prompts based on output quality
  4. Add domain expertise: Supplement with edge cases AI doesn't know
  5. Use for boilerplate: Let AI handle repetitive setup/teardown code
  6. Validate locally: Run tests multiple times before committing

❌ DON'T:

  1. Blindly trust output: AI makes mistakes, especially with domain logic
  2. Skip code review: Treat AI code like junior developer code
  3. Forget maintenance: AI-generated tests still need updates
  4. Over-rely on AI: Critical tests should be human-designed
  5. Ignore quality issues: Fix flaky waits, brittle selectors immediately
  6. Miss security tests: LLMs often miss security edge cases

Conclusion

LLMs can reduce test writing time by 80%, but only if you use them correctly.

Key insights:

  1. LLMs excel at boilerplate and common patterns
  2. Humans must provide domain context and strategic thinking
  3. Quality review is non-negotiable
  4. Best results come from AI + human collaboration, not replacement

The workflow that works:

  1. Human defines test strategy
  2. LLM generates test code
  3. Human reviews and augments
  4. LLM helps maintain/refactor
  5. Human validates quality

Think of LLMs as a highly productive junior engineer who needs review and guidance but can dramatically accelerate output.

Ready to 10x your test automation productivity? Sign up for ScanlyApp and integrate AI-powered test generation into your QA workflow today.

Related articles: Also see comparing LLM-based testing tools side by side, making LLM-generated tests more resilient with self-healing, and design patterns that keep AI-generated tests maintainable.

Related Posts