Using LLMs to Write E2E Tests: Generate Production-Quality Test Suites in Minutes

"Write comprehensive Playwright tests for user authentication including login, signup, password reset, and edge cases."

You press Enter. Ten seconds later, GPT-4 outputs 300 lines of working test code covering 15 scenarios you hadn't even thought of. You copy-paste it. It runs. It passes. You just saved 4 hours of work.

This isn't science fiction—it's 2027.

But here's what they don't tell you: Those tests fail next month when the UI changes. The AI missed a critical security edge case. The generated code has subtle race conditions that make tests flaky. And you have no idea what the tests actually validate because you didn't write them.

LLMs can write tests faster than humans, but they can't replace QA thinking.

This guide shows you how to leverage LLMs to dramatically accelerate test creation while avoiding the pitfalls that make AI-generated tests a maintenance nightmare.

What LLMs Are Actually Good At

graph LR
    A[LLM Strengths] --> B[Pattern Recognition]
    A --> C[Code Generation]
    A --> D[Boilerplate]
    A --> E[Common Scenarios]

    F[LLM Weaknesses] --> G[Domain Context]
    F --> H[Edge Cases]
    F --> I[Business Logic]
    F --> J[Strategic Thinking]

    style A fill:#c5e1a5
    style F fill:#ffccbc

    B --> K[✅ Recognizes test patterns<br/>from training data]
    C --> L[✅ Generates syntactically<br/>correct code]
    D --> M[✅ Writes setup/teardown<br/>boilerplate]
    E --> N[✅ Covers happy path &<br/>obvious errors]

    G --> O[❌ Doesn't know your<br/>specific app]
    H --> P[❌ Misses subtle<br/>edge cases]
    I --> Q[❌ Can't understand<br/>business requirements]
    J --> R[❌ Can't prioritize<br/>what to test]

Strength vs Weakness Comparison

Task	LLM Performance	Why
Generate basic CRUD tests	★★★★★ Excellent	Pattern well-known from training data
Write test boilerplate	★★★★★ Excellent	Repetitive structure, clear patterns
Cover happy path	★★★★☆ Very Good	Obvious scenarios, standard flows
Add common validations	★★★★☆ Very Good	Trained on best practices
Generate edge cases	★★★☆☆ Moderate	Generic edges, misses domain-specific
Test security vulnerabilities	★★☆☆☆ Poor	Requires security domain knowledge
Domain-specific testing	★★☆☆☆ Poor	No context about your app
Strategic test prioritization	★☆☆☆☆ Very Poor	Can't assess business risk

The LLM Test Generation Workflow

graph TD
    A[Feature Requirement] --> B[Human: Define Test Strategy]
    B --> C[Human: Write Prompt]
    C --> D[LLM: Generate Tests]
    D --> E[Human: Code Review]
    E --> F{Quality Check}

    F -->|Good| G[Human: Add Edge Cases]
    F -->|Issues| H[Human: Refine Prompt]
    H --> D

    G --> I[Human: Add Assertions]
    I --> J[Run Tests]
    J --> K{Tests Pass?}

    K -->|Yes| L[Human: Exploratory Testing]
    K -->|No| M[Debug & Fix]
    M --> J

    L --> N[Commit Tests]
    N --> O[LLM: Generate Documentation]

    style B fill:#bbdefb
    style C fill:#bbdefb
    style E fill:#bbdefb
    style G fill:#bbdefb
    style I fill:#bbdefb
    style L fill:#bbdefb

Implementation: AI Test Generator

1. AI-Powered QA Test Generation Techniques

// llm-test-generator.ts
interface TestGenerationPrompt {
  feature: string;
  userStory: string;
  acceptanceCriteria: string[];
  technicalContext: {
    framework: 'playwright' | 'cypress' | 'selenium';
    language: 'typescript' | 'javascript';
    pageObjects: string[];
  };
  existingTests?: string; // For context
}

class LLMTestGenerator {
  private apiKey: string;

  constructor(apiKey: string) {
    this.apiKey = apiKey;
  }

  async generateTests(prompt: TestGenerationPrompt): Promise<string> {
    const systemPrompt = this.buildSystemPrompt();
    const userPrompt = this.buildUserPrompt(prompt);

    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        Authorization: `Bearer ${this.apiKey}`,
      },
      body: JSON.stringify({
        model: 'gpt-4-turbo',
        messages: [
          { role: 'system', content: systemPrompt },
          { role: 'user', content: userPrompt },
        ],
        temperature: 0.3, // Lower temperature for more consistent code
        max_tokens: 4000,
      }),
    });

    const data = await response.json();
    return this.extractCode(data.choices[0].message.content);
  }

  private buildSystemPrompt(): string {
    return `You are an expert QA engineer specializing in end-to-end test automation.

Your task is to generate comprehensive, production-ready Playwright tests in TypeScript.

CRITICAL REQUIREMENTS:
1. Use ONLY getByRole, getByLabel, getByText (accessible selectors)
2. NEVER use CSS selectors or XPath unless absolutely necessary
3. Add explicit waits (waitForLoadState, waitForResponse) not waitForTimeout
4. Include meaningful error messages in assertions
5. Follow AAA pattern (Arrange, Act, Assert)
6. Add comments explaining complex test logic
7. Use page object pattern when dealing with multiple pages
8. Consider accessibility, performance, and edge cases
9. Add test.describe blocks for logical grouping
10. Each test must be independent and not rely on others

BEST PRACTICES:
- Use descriptive test names that explain expected behavior
- Add beforeEach hooks for common setup
- Use test.fixme() or test.skip() with explanations when needed
- Include both positive and negative test cases
- Test error states and validation messages
- Consider responsive design and different viewport sizes`;
  }

  private buildUserPrompt(prompt: TestGenerationPrompt): string {
    const { feature, userStory, acceptanceCriteria, technicalContext, existingTests } = prompt;

    return `Generate comprehensive E2E tests for the following feature:

FEATURE: ${feature}

USER STORY:
${userStory}

ACCEPTANCE CRITERIA:
${acceptanceCriteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}

TECHNICAL CONTEXT:
- Framework: ${technicalContext.framework}
- Language: ${technicalContext.language}
- Available Page Objects: ${technicalContext.pageObjects.join(', ')}

${existingTests ? `EXISTING TESTS (for context):\n\`\`\`typescript\n${existingTests}\n\`\`\`` : ''}

Generate tests that:
1. Cover all acceptance criteria
2. Include edge cases and error scenarios
3. Test accessibility (keyboard navigation, screen reader support)
4. Validate error messages and loading states
5. Are maintainable and follow best practices

Return ONLY the test code, no explanations.`;
  }

  private extractCode(content: string): string {
    // Extract code from markdown code blocks
    const match = content.match(/```(?:typescript|javascript)?\n([\s\S]*?)\n```/);
    return match ? match[1] : content;
  }
}

2. Intelligent Test Refinement

// test-refiner.ts
interface TestQualityAnalysis {
  score: number; // 0-100
  issues: Array<{
    severity: 'critical' | 'high' | 'medium' | 'low';
    type: string;
    description: string;
    suggestion: string;
  }>;
  strengths: string[];
}

class TestQualityAnalyzer {
  analyzeGeneratedTest(testCode: string): TestQualityAnalysis {
    const issues: TestQualityAnalysis['issues'] = [];
    const strengths: string[] = [];

    // Check for brittle selectors
    if (testCode.includes('.click()') && !testCode.includes('getByRole')) {
      issues.push({
        severity: 'high',
        type: 'brittle_selector',
        description: 'Using non-semantic selectors',
        suggestion: 'Replace with getByRole, getByLabel, or getByText for better maintainability',
      });
    } else {
      strengths.push('Uses semantic, accessible selectors');
    }

    // Check for hardcoded waits
    const hardcodedWaits = (testCode.match(/waitForTimeout\(/g) || []).length;
    if (hardcodedWaits > 0) {
      issues.push({
        severity: 'critical',
        type: 'flaky_wait',
        description: `Found ${hardcodedWaits} hardcoded wait(s)`,
        suggestion: 'Replace waitForTimeout with waitForLoadState or waitForSelector',
      });
    } else {
      strengths.push('Uses explicit waits instead of sleep/timeout');
    }

    // Check for meaningful assertions
    const assertions = (testCode.match(/expect\(/g) || []).length;
    if (assertions < 2) {
      issues.push({
        severity: 'high',
        type: 'weak_assertions',
        description: 'Too few assertions',
        suggestion: 'Add more assertions to validate expected behavior',
      });
    } else {
      strengths.push(`Contains ${assertions} assertions`);
    }

    // Check for test independence
    if (!testCode.includes('beforeEach') && testCode.split('test(').length > 3) {
      issues.push({
        severity: 'medium',
        type: 'missing_setup',
        description: 'Multiple tests without beforeEach setup',
        suggestion: 'Extract common setup to beforeEach hook',
      });
    }

    // Check for error handling
    if (testCode.includes('try {')) {
      strengths.push('Includes error handling');
    }

    // Check for accessibility testing
    if (testCode.includes('getByRole') || testCode.includes('getByLabel')) {
      strengths.push('Uses accessibility-first selectors');
    }

    // Calculate score
    const criticalCount = issues.filter((i) => i.severity === 'critical').length;
    const highCount = issues.filter((i) => i.severity === 'high').length;
    const mediumCount = issues.filter((i) => i.severity === 'medium').length;

    let score = 100;
    score -= criticalCount * 30;
    score -= highCount * 15;
    score -= mediumCount * 5;
    score = Math.max(0, score);

    return { score, issues, strengths };
  }

  async refineTest(testCode: string, analysis: TestQualityAnalysis): Promise<string> {
    if (analysis.score >= 80) {
      return testCode; // Good enough
    }

    // Use LLM to fix issues
    const generator = new LLMTestGenerator(process.env.OPENAI_API_KEY!);

    const refinementPrompt = `
Refine the following Playwright test to fix these issues:

${analysis.issues.map((issue) => `- [${issue.severity}] ${issue.description}: ${issue.suggestion}`).join('\n')}

ORIGINAL TEST:
\`\`\`typescript
${testCode}
\`\`\`

Return the improved test code that addresses all issues. Maintain the same test coverage.
`;

    return await generator.generateTests({
      feature: 'Test Refinement',
      userStory: refinementPrompt,
      acceptanceCriteria: analysis.issues.map((i) => i.suggestion),
      technicalContext: {
        framework: 'playwright',
        language: 'typescript',
        pageObjects: [],
      },
    });
  }
}

3. Context-Aware Test Generation

// context-aware-generator.ts
interface AppContext {
  pageStructure: Record<string, string[]>; // page -> elements
  apiEndpoints: string[];
  authRequired: boolean;
  userRoles: string[];
}

class ContextAwareTestGenerator {
  private generator: LLMTestGenerator;
  private analyzer: TestQualityAnalyzer;

  constructor(apiKey: string) {
    this.generator = new LLMTestGenerator(apiKey);
    this.analyzer = new TestQualityAnalyzer();
  }

  async generateWithContext(
    feature: string,
    userStory: string,
    context: AppContext,
  ): Promise<{ code: string; quality: TestQualityAnalysis }> {
    // Enrich prompt with application context
    const enrichedPrompt: TestGenerationPrompt = {
      feature,
      userStory,
      acceptanceCriteria: this.extractAcceptanceCriteria(userStory),
      technicalContext: {
        framework: 'playwright',
        language: 'typescript',
        pageObjects: Object.keys(context.pageStructure),
      },
      existingTests: this.generateContextExample(context),
    };

    // Generate tests
    let testCode = await this.generator.generateTests(enrichedPrompt);

    // Analyze quality
    let analysis = this.analyzer.analyzeGeneratedTest(testCode);

    // Refine if needed (up to 3 iterations)
    let iterations = 0;
    while (analysis.score < 70 && iterations < 3) {
      console.log(`Quality score: ${analysis.score}. Refining...`);
      testCode = await this.analyzer.refineTest(testCode, analysis);
      analysis = this.analyzer.analyzeGeneratedTest(testCode);
      iterations++;
    }

    console.log(`✅ Generated tests with quality score: ${analysis.score}`);

    return { code: testCode, quality: analysis };
  }

  private extractAcceptanceCriteria(userStory: string): string[] {
    // Simple extraction - in production, use more sophisticated parsing
    const lines = userStory.split('\n');
    return lines.filter((line) => line.trim().match(/^[-*]\s+/)).map((line) => line.replace(/^[-*]\s+/, '').trim());
  }

  private generateContextExample(context: AppContext): string {
    // Generate example tests showing app structure
    return `// Example showing app structure:
test('example', async ({ page }) => {
  ${
    context.authRequired
      ? `await page.goto('/login');
  await page.getByRole('button', { name: 'Login' }).click();`
      : ''
  }
  
  // Available pages: ${Object.keys(context.pageStructure).join(', ')}
  // API endpoints: ${context.apiEndpoints.slice(0, 3).join(', ')}
});`;
  }
}

4. Complete Test Generation Pipeline

// test-generation-pipeline.ts
import { writeFile } from 'fs/promises';
import { join } from 'path';

interface GeneratedTestSuite {
  filename: string;
  code: string;
  quality: TestQualityAnalysis;
  coverage: {
    scenarios: number;
    edgeCases: number;
    assertions: number;
  };
}

class TestGenerationPipeline {
  private generator: ContextAwareTestGenerator;

  constructor(apiKey: string) {
    this.generator = new ContextAwareTestGenerator(apiKey);
  }

  async generateTestSuite(feature: string, requirements: string, context: AppContext): Promise<GeneratedTestSuite> {
    console.log(`🤖 Generating tests for: ${feature}`);

    // Step 1: Generate tests with context
    const { code, quality } = await this.generator.generateWithContext(feature, requirements, context);

    // Step 2: Analyze coverage
    const coverage = this.analyzeCoverage(code);

    // Step 3: Add human review markers
    const annotatedCode = this.addReviewMarkers(code, quality);

    // Step 4: Save to file
    const filename = this.generateFilename(feature);
    await this.saveTest(filename, annotatedCode);

    console.log(`✅ Generated ${filename}`);
    console.log(`   Quality: ${quality.score}/100`);
    console.log(`   Coverage: ${coverage.scenarios} scenarios, ${coverage.assertions} assertions`);

    return { filename, code: annotatedCode, quality, coverage };
  }

  private analyzeCoverage(code: string): GeneratedTestSuite['coverage'] {
    return {
      scenarios: (code.match(/test\(/g) || []).length,
      edgeCases: (code.match(/edge case|boundary|invalid|error/gi) || []).length,
      assertions: (code.match(/expect\(/g) || []).length,
    };
  }

  private addReviewMarkers(code: string, quality: TestQualityAnalysis): string {
    let annotated = `/**
 * AUTO-GENERATED TEST SUITE
 * Generated at: ${new Date().toISOString()}
 * Quality Score: ${quality.score}/100
 * 
 * ⚠️  HUMAN REVIEW REQUIRED:
${quality.issues.map((issue) => ` * - [${issue.severity}] ${issue.description}`).join('\n')}
 * 
 * ✅ Strengths:
${quality.strengths.map((s) => ` * - ${s}`).join('\n')}
 */

${code}
`;

    // Add inline comments for critical issues
    quality.issues
      .filter((i) => i.severity === 'critical' || i.severity === 'high')
      .forEach((issue) => {
        // This is simplified - in production, use AST manipulation
        annotated = `// TODO: ${issue.description} - ${issue.suggestion}\n${annotated}`;
      });

    return annotated;
  }

  private generateFilename(feature: string): string {
    const slug = feature.toLowerCase().replace(/[^a-z0-9]+/g, '-');
    return `${slug}.spec.ts`;
  }

  private async saveTest(filename: string, code: string): Promise<void> {
    const filepath = join(process.cwd(), 'tests', 'generated', filename);
    await writeFile(filepath, code, 'utf-8');
  }
}

// Usage example
async function main() {
  const pipeline = new TestGenerationPipeline(process.env.OPENAI_API_KEY!);

  const context: AppContext = {
    pageStructure: {
      '/login': ['email input', 'password input', 'submit button'],
      '/dashboard': ['user menu', 'project list', 'create button'],
      '/settings': ['profile form', 'password form', 'delete button'],
    },
    apiEndpoints: ['/api/auth/login', '/api/projects', '/api/users'],
    authRequired: true,
    userRoles: ['user', 'admin'],
  };

  const suite = await pipeline.generateTestSuite(
    'User Authentication',
    `As a user, I want to log in securely so that I can access my dashboard.
    - User can log in with valid credentials
    - User sees error with invalid credentials
    - User is redirected to dashboard after successful login
    - User can reset forgotten password
    - Login form validates email format
    - Login attempts are rate-limited after 5 failures`,
    context,
  );

  console.log(`\n📊 Test Suite Summary:`);
  console.log(`   File: ${suite.filename}`);
  console.log(`   Quality: ${suite.quality.score}/100`);
  console.log(`   Scenarios: ${suite.coverage.scenarios}`);
  console.log(`   Assertions: ${suite.coverage.assertions}`);

  if (suite.quality.issues.length > 0) {
    console.log(`\n⚠️  Issues requiring review:`);
    suite.quality.issues.forEach((issue) => {
      console.log(`   [${issue.severity}] ${issue.description}`);
    });
  }
}

main().catch(console.error);

Real-World Results

Time Savings

Task	Manual Time	LLM-Assisted	Savings
Simple CRUD tests	2 hours	15 minutes	87.5%
Complex user flows	6 hours	1.5 hours	75%
API integration tests	4 hours	45 minutes	81%
Accessibility tests	3 hours	30 minutes	83%
Error scenario tests	2 hours	20 minutes	83%
Overall average	-	-	~80%

Quality Metrics (After Human Review)

Metric	LLM-Only	LLM + Human	Traditional
Test Coverage	85%	95%	92%
Flakiness Rate	12%	3%	5%
Maintenance Burden	High	Medium	Medium
Edge Case Coverage	60%	90%	85%
Time to Create	Fast	Fast	Slow

Best Practices for LLM Test Generation

✅ DO:

Provide rich context: App structure, existing patterns, domain knowledge
Review thoroughly: Never commit AI-generated code without review
Iterate prompts: Refine prompts based on output quality
Add domain expertise: Supplement with edge cases AI doesn't know
Use for boilerplate: Let AI handle repetitive setup/teardown code
Validate locally: Run tests multiple times before committing

❌ DON'T:

Blindly trust output: AI makes mistakes, especially with domain logic
Skip code review: Treat AI code like junior developer code
Forget maintenance: AI-generated tests still need updates
Over-rely on AI: Critical tests should be human-designed
Ignore quality issues: Fix flaky waits, brittle selectors immediately
Miss security tests: LLMs often miss security edge cases

Conclusion

LLMs can reduce test writing time by 80%, but only if you use them correctly.

Key insights:

LLMs excel at boilerplate and common patterns
Humans must provide domain context and strategic thinking
Quality review is non-negotiable
Best results come from AI + human collaboration, not replacement

The workflow that works:

Human defines test strategy
LLM generates test code
Human reviews and augments
LLM helps maintain/refactor
Human validates quality

Think of LLMs as a highly productive junior engineer who needs review and guidance but can dramatically accelerate output.

Ready to 10x your test automation productivity? Sign up for ScanlyApp and integrate AI-powered test generation into your QA workflow today.

Using LLMs to Write E2E Tests: Generate Production-Quality Test Suites in Minutes

Using LLMs to Write E2E Tests: Generate Production-Quality Test Suites in Minutes

What LLMs Are Actually Good At

Strength vs Weakness Comparison

The LLM Test Generation Workflow

Implementation: AI Test Generator

1. AI-Powered QA Test Generation Techniques

2. Intelligent Test Refinement

3. Context-Aware Test Generation

4. Complete Test Generation Pipeline

Real-World Results

Time Savings

Quality Metrics (After Human Review)

Best Practices for LLM Test Generation

✅ DO:

❌ DON'T:

Conclusion

Related Posts

Will AI Replace QA Engineers? An Honest Answer for 2026

AI-Powered Log Analysis: Finding Critical Errors in a Sea of Noise

Self-Healing Test Automation: How AI Fixes Broken Tests While You Sleep