Testing Strategy
Untested code is an unverified artifact—bugs, regressions, and security defects ship silently until production proves you wrong. This chapter maps the test pyramid to pipeline stages: fast unit gates on every commit, integration tests with real dependencies, targeted E2E smoke paths, performance budgets in CI, unified reporting, and a disciplined approach to flaky tests that erode trust in green builds.
Test Pyramid
The test pyramid is a capacity-planning model for confidence: many fast, isolated tests at the base; fewer integration tests in the middle; a thin cap of end-to-end scenarios that prove critical user journeys. Invert the pyramid and CI becomes slow, brittle, and ignored—a direct path to shipping defects and bypassing security gates.
Threat model: untested and mis-tested code
Every merge without adequate tests is a gamble. Regressions slip through review; authorization bugs hide behind happy-path E2E; performance cliffs surface only under production load. Security teams cannot sign off on artifacts whose behavior was never verified. The pyramid exists to allocate finite CI minutes and developer attention where they catch the most defects per dollar.
| Layer | Scope | Speed target | What it catches |
|---|---|---|---|
| Unit | Single function/class; deps mocked | < 10 ms per test | Logic bugs, edge cases, input validation, crypto helpers |
| Integration | Module + real DB, queue, HTTP peer | Seconds per suite | SQL migrations, serialization, retry semantics, contract drift |
| E2E / acceptance | Full stack, browser or API client | Minutes per scenario | Routing, auth flows, deploy config, cross-service wiring |
The inverted pyramid anti-pattern
Teams under schedule pressure often pile UI-driven Selenium suites on top of a shallow unit base—the ice-cream cone or inverted pyramid. Every release waits on a two-hour E2E job; failures are ambiguous (was it the UI, the API, the DB seed, or a flaky network?); developers re-run until green or merge with [skip ci]. Security regressions in business logic never get E2E coverage because combinatorics explode.
| Pattern | Symptom | Outcome |
|---|---|---|
| Healthy pyramid | 80% unit, 15% integration, 5% E2E | Sub-5-minute PR feedback; failures pinpoint modules |
| Inverted pyramid | Handful of unit tests; hundreds of UI tests | 45+ minute pipelines; flaky red builds; merge anyway |
| Testing trophy (Kent C. Dodds) | Emphasis on integration for UI-heavy apps | Valid for SPAs—still keep unit base for pure logic |
| No pyramid (YOLO) | Manual QA only before release | Audit failure; incident-driven hotfixes |
Swiss cheese model of defense
No single test layer catches everything—think Swiss cheese: each layer has holes; stacked layers reduce the chance a defect reaches production. Unit tests hole-punch on mocked I/O; integration tests miss UI routing; E2E misses rare branch logic. Add SAST/SCA (Security Scanning) as another slice. Pipeline design maps layers to stages so a failure at any slice blocks promotion.
flowchart TB
subgraph pyramid["Test pyramid → CI stages"]
U["Unit tests\n(every commit, < 3 min)"]
I["Integration\n(PR + main, Testcontainers)"]
E["E2E smoke\n(post-deploy staging)"]
P["Performance budget\n(nightly / release)"]
end
U --> I --> E --> P
subgraph gates["Pipeline gates"]
G1["Coverage ≥ 80%"]
G2["Pact verify"]
G3["Playwright smoke"]
G4["k6 p95 < 300ms"]
end
U -.-> G1
I -.-> G2
E -.-> G3
P -.-> G4
Pipeline integration: layer → stage mapping
Align pyramid layers with CI pipeline stages. Unit tests run on every push with concurrency cancel; integration runs after build artifact exists; E2E runs against deployed staging with the same image digest that passed earlier gates—not a rebuild.
name: Test Pyramid
on: [pull_request, push]
jobs:
unit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run test:unit -- --maxWorkers=4
integration:
needs: unit
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env: { POSTGRES_PASSWORD: test }
ports: ['5432:5432']
steps:
- uses: actions/checkout@v4
- run: npm run test:integration
e2e-smoke:
needs: integration
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- run: npx playwright test --grep @smoke
stages: [unit, integration, e2e]
unit-test:
stage: unit
script: [npm ci, npm run test:unit]
artifacts:
reports:
junit: junit-unit.xml
integration-test:
stage: integration
services: [postgres:16]
script: [npm run test:integration]
needs: [unit-test]
e2e-smoke:
stage: e2e
script: [npx playwright test --grep @smoke]
needs: [integration-test]
rules:
- if: $CI_COMMIT_BRANCH == "main"
Counting only E2E scenarios as "test coverage" hides untested branches in services. Executives see green dashboards while payment edge cases have zero unit tests. Report pyramid distribution (test count and runtime per layer), not just pass rate.
Target 70/20/10 as a starting ratio (unit/integration/E2E by count). Tune by service: API backends skew more unit; BFF layers skew integration. Track median PR pipeline duration—if it exceeds 12 minutes, trim E2E before cutting unit tests.
When asked "how do you test microservices?", draw the pyramid, then add contract tests (Pact) as a horizontal slice between integration and E2E—they verify API compatibility without full environment cost.
More E2E increases confidence in wiring but slows feedback and amplifies flakiness. More unit speeds CI but over-mocking hides integration failures. The pyramid is the compromise—adjust the middle layer first when in doubt.
Unit Testing
Unit tests are the high-frequency radar of your codebase—thousands of assertions proving that pure logic, validators, and security helpers behave under edge cases. They must be fast, deterministic, and isolated from network, clock, and filesystem unless explicitly testing adapters.
Threat → solution
Without unit tests, every change is validated only by compilation and manual click-through. Input sanitization regressions, off-by-one authorization checks, and broken HMAC verification ship silently. The solution is a dense unit suite that runs in under three minutes on every commit, with coverage and mutation gates that prove tests actually assert behavior—not just execute lines.
| Stack | Framework | Mocking | Coverage tool |
|---|---|---|---|
| Java | JUnit 5 + AssertJ | Mockito / WireMock (unit only) | JaCoCo |
| JavaScript / TS | Vitest or Jest | vi.mock / jest.mock | Istanbul (c8/nyc) |
| Python | pytest + pytest-cov | unittest.mock / pytest-mock | coverage.py |
Fast tests: rules of thumb
- No real network, DB, or sleep—inject clocks and fake repos.
- Parallelize: pytest -n auto, Jest --maxWorkers, Surefire parallel.
- Keep fixtures in memory; use factories (factory_boy, @testFactory) not 10 MB JSON blobs.
- Fail on slow tests: pytest --durations=20 in CI to surface offenders > 500 ms.
Mocks: when they help and when they lie
Mocks stub collaborators so you test one class in isolation. Over-mocking creates tests that pass while production fails— the mock returns success but the real HTTP client times out. Prefer fakes (in-memory repo) over mocks for persistence; reserve mocks for external SDKs you do not own.
# tests/unit/test_token_service.py
from unittest.mock import Mock
from app.auth.token_service import TokenService
def test_verify_rejects_expired_token():
clock = Mock()
clock.now.return_value = 1_700_000_000
svc = TokenService(clock=clock, secret="test-secret")
assert svc.verify("expired.jwt.here") is False
Coverage thresholds
Line coverage alone is a vanity metric—100% coverage with empty assertions proves nothing. Still, a minimum threshold (typically 80% line / 70% branch) prevents large untested dumps from merging. Enforce differential coverage on changed files for monorepos: only the diff must meet the bar.
Mutation testing: Pitest & Stryker
Mutation frameworks inject small bugs (flip > to >=, delete a guard) and check if tests fail. High line coverage with low mutation score means tests are hollow. Run mutation nightly—not every PR—it's CPU-heavy.
| Tool | Language | CI cost | Typical gate |
|---|---|---|---|
| Pitest | Java / Kotlin | High (minutes per module) | Mutation score ≥ 75% on core packages |
| Stryker | JS / TS | High | Break threshold 80% on src/auth/** |
| mutmut | Python | Medium | Nightly report; block on auth module regression |
Pipeline integration: coverage gate
name: Unit + Coverage Gate
on: pull_request
jobs:
unit-java:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with: { distribution: temurin, java-version: 21 }
- run: mvn -B test jacoco:report
- name: Enforce JaCoCo minimum
run: |
mvn jacoco:check -Djacoco.haltOnFailure=true \
-Djacoco.rules.coverage=0.80
- uses: actions/upload-artifact@v4
with:
name: jacoco-xml
path: target/site/jacoco/jacoco.xml
unit-js:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run test:unit -- --coverage
- name: Istanbul threshold
run: |
node -e "
const s=require('./coverage/coverage-summary.json').total;
if(s.lines.pct<80) process.exit(1);
"
unit-test:
stage: test
image: eclipse-temurin:21
script:
- mvn -B test jacoco:report jacoco:check
coverage: '/Total.*?([0-9]{1,3})%/'
artifacts:
reports:
junit: target/surefire-reports/TEST-*.xml
coverage_report:
coverage_format: jacoco
path: target/site/jacoco/jacoco.xml
unit-js:
stage: test
image: node:20
script:
- npm ci
- npm run test:unit -- --coverage --coverageReporters=cobertura
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
$ pytest tests/unit -q --cov=app --cov-fail-under=80 $ mvn test -Dtest="*UnitTest" $ npm run test:unit -- --watchAll=false $ mvn org.pitest:pitest-maven:mutationCoverage -DtargetClasses=com.acme.auth.*
Unit-test security primitives directly: password hashing rounds, JWT claim validation, RBAC matrix lookups, rate-limit counters. These are cheap to test and expensive to miss—SAST won't catch a logic bug that grants admin to every user when role is null.
JaCoCo instruments bytecode at runtime via Java agent; Istanbul instruments JS via Babel/V8 coverage APIs. Both produce XML/JSON consumed by SonarQube, Codecov, and GitLab's native coverage badge parser.
Payment teams often require 90%+ mutation score on ledger modules while allowing 60% line coverage on admin UI— risk-weighted thresholds beat one-size-fits-all gates.
Integration Testing
Integration tests prove your code works with real collaborators—PostgreSQL, Redis, Kafka, partner HTTP APIs— without paying the full cost of E2E environments. Testcontainers, Spring Boot test slices, WireMock, and Pact contracts form the middle layer that catches wiring bugs unit mocks hide.
Threat model
Unit tests with mocked repositories never validate SQL migrations, connection pool exhaustion, or JSON field renames across services. Deploying untested integration paths causes silent data corruption, 500s on first real traffic, and cross-team blame when "your service changed the schema." Consumer-driven contracts prevent the classic microservices trap: provider ships breaking API; consumer discovers it in production Friday night.
| Tool | Use case | Runs in CI | Prerequisite |
|---|---|---|---|
| Testcontainers | Ephemeral Postgres, Redis, LocalStack | Docker-in-Docker or privileged runner | Docker socket or cloud container service |
| Spring Boot Test | @SpringBootTest, @DataJpaTest | Standard JVM runner | Test profile YAML; Testcontainers for DB |
| WireMock | Stub external HTTP APIs | Embedded JAR; no network egress | Record/replay or hand-crafted stubs |
| Pact | Consumer-driven API contracts | Publish to Pact Broker on merge | Broker URL + token; provider verify job |
Testcontainers pattern
Testcontainers spins real container images per test class or suite, maps random ports, and tears down after JVM exit. Use singleton containers for speed (one Postgres for entire suite) with careful state reset between tests.
@Testcontainers
@SpringBootTest
class OrderRepositoryIT {
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:16")
.withDatabaseName("orders_test");
@DynamicPropertySource
static void props(DynamicPropertyRegistry r) {
r.add("spring.datasource.url", postgres::getJdbcUrl);
r.add("spring.datasource.username", postgres::getUsername);
r.add("spring.datasource.password", postgres::getPassword);
}
@Autowired OrderRepository repo;
@Test void persistsOrderWithIdempotencyKey() {
var order = repo.save(new Order("idem-abc", BigDecimal.TEN));
assertThat(repo.findByIdempotencyKey("idem-abc")).contains(order);
}
}
WireMock for outbound dependencies
When you cannot run the real payment gateway in CI, WireMock simulates latency, error codes, and payload shapes. Record production traffic once (sanitized), commit stubs to VCS, verify stubs match OpenAPI spec in lint stage.
Pact: consumer-driven contracts
The consumer writes a pact file describing expected request/response. CI publishes to the Pact Broker. The provider runs verify pact against deployed or in-process API—breaking changes fail before merge. Enable can-i-deploy to block deploy when consumer/provider versions are incompatible.
sequenceDiagram participant C as Consumer CI participant B as Pact Broker participant P as Provider CI participant CD as Deploy gate C->>C: Run consumer tests → pact.json C->>B: Publish pact (v1.2.3+abc123) P->>B: Fetch pending pacts P->>P: Verify against provider API P->>B: Record verification result CD->>B: can-i-deploy? consumer + provider + env B-->>CD: OK / BLOCK
Pipeline integration
name: Integration + Pact
on: [pull_request, push]
jobs:
integration:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with: { distribution: temurin, java-version: 21 }
- run: docker pull postgres:16
- run: mvn -B verify -Pintegration
- uses: actions/upload-artifact@v4
with:
name: junit-integration
path: target/failsafe-reports/*.xml
pact-consumer:
needs: integration
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run test:pact
- name: Publish to Pact Broker
run: |
npx pact-broker publish ./pacts \
--consumer-app-version=${{ github.sha }} \
--broker-base-url=${{ secrets.PACT_BROKER_URL }} \
--broker-token=${{ secrets.PACT_BROKER_TOKEN }}
pact-provider-verify:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- run: |
npx pact-broker verify \
--provider payments-api \
--broker-base-url=$PACT_BROKER_URL \
--provider-base-url=https://staging.internal/api
integration-test:
stage: test
image: docker:24
services: [docker:24-dind]
variables:
DOCKER_TLS_CERTDIR: "/certs"
script:
- docker pull postgres:16
- mvn -B verify -Pintegration
artifacts:
reports:
junit: target/failsafe-reports/TEST-*.xml
pact-publish:
stage: test
image: node:20
script:
- npm run test:pact
- npx pact-broker publish ./pacts
--consumer-app-version=$CI_COMMIT_SHA
--broker-base-url=$PACT_BROKER_URL
--broker-token=$PACT_BROKER_TOKEN
needs: [integration-test]
pact-verify:
stage: test
script:
- pact-broker verify --provider payments-api
--broker-base-url=$PACT_BROKER_URL
--provider-base-url=$STAGING_API_URL
rules:
- if: $CI_COMMIT_BRANCH == "main"
Sharing a long-lived staging database across integration tests causes order-dependent failures and data leaks between runs. Ephemeral Testcontainers per job cost seconds of pull time but eliminate cross-talk.
Pre-pull container images in a warm-up job or use Ryuk-disabled Testcontainers with image cache on self-hosted runners— cuts integration stage from 8 minutes to 3 on large monorepos.
Never point integration tests at production databases—even read-only replicas leak PII in logs. Use synthetic fixtures; scan test SQL dumps for secrets before commit. Pact Broker tokens are deploy-gate credentials—store in vault, rotate quarterly.
Full environment integration (docker-compose entire stack) catches more bugs but mirrors E2E flakiness. Focused integration (one real dependency per test) scales better—compose only what the module under test touches.
E2E Testing
End-to-end tests validate the system as users experience it—browser clicks, API chains, auth cookies, CDN headers. Keep the suite small and stable: smoke paths on every deploy, full regression nightly, synthetic probes in production.
Threat model
Integration tests prove services talk correctly; they do not prove the login button works after a CSS refactor or that TLS termination routes /api to the right backend. E2E gaps manifest as revenue-impacting outages— checkout broken, MFA loop, admin panel exposed without auth. Conversely, a bloated E2E suite blocks releases and trains teams to ignore red builds.
| Framework | Strength | Weakness | Best for |
|---|---|---|---|
| Playwright | Multi-browser, auto-wait, trace viewer | Learning curve for fixtures | Modern SPAs, API + UI combo |
| Cypress | Great DX, time-travel debugger | Single-tab; no WebDriver Safari | React/Vue teams, component E2E |
| Selenium Grid | Mature, language bindings | Flaky without explicit waits | Legacy enterprise suites |
Smoke tests post-deploy
After staging deploy, run a @smoke tagged subset: login, create resource, read resource, logout—under two minutes. Gate production promotion on smoke green. Use the same container digest tested in CI, deployed to staging via GitOps.
// e2e/smoke/auth.spec.ts
import { test, expect } from '@playwright/test';
test.describe('@smoke', () => {
test('user can sign in and reach dashboard', async ({ page }) => {
await page.goto('/login');
await page.getByLabel('Email').fill(process.env.E2E_USER!);
await page.getByLabel('Password').fill(process.env.E2E_PASS!);
await page.getByRole('button', { name: 'Sign in' }).click();
await expect(page).toHaveURL(/dashboard/);
await expect(page.getByRole('heading', { name: 'Overview' })).toBeVisible();
});
});
Synthetic monitoring
Production synthetics (Checkly, Datadog Synthetics, Grafana k6 cloud) hit critical paths every few minutes from multiple regions. They are E2E tests with SLAs—alert on-call when p95 exceeds budget, not when a developer's laptop sleep broke local WebDriver. Store credentials in vault; rotate synthetic user passwords like any service account.
Visual regression
Percy, Chromatic, or Playwright snapshot compare screenshots against baselines—catch unintended UI drift. Run on design-system PRs and marketing pages; avoid snapshotting dynamic charts with live data (false positives). Store baselines in Git LFS or vendor cloud; review diffs in PR checks.
name: E2E Smoke (post-deploy)
on:
workflow_dispatch:
inputs:
staging_url: { required: true }
deployment_status:
jobs:
smoke:
if: github.event.deployment_status.state == 'success'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci && npx playwright install --with-deps chromium
- name: Run smoke against staging
env:
BASE_URL: ${{ github.event.deployment_status.environment_url }}
E2E_USER: ${{ secrets.E2E_USER }}
E2E_PASS: ${{ secrets.E2E_PASS }}
run: npx playwright test --grep @smoke --retries=0
- uses: actions/upload-artifact@v4
if: failure()
with:
name: playwright-traces
path: test-results/
e2e-smoke:
stage: deploy-verify
image: mcr.microsoft.com/playwright:v1.42.0-jammy
variables:
BASE_URL: https://staging.example.com
script:
- npm ci
- npx playwright test --grep @smoke --retries=0
artifacts:
when: on_failure
paths: [test-results/, playwright-report/]
needs: [deploy-staging]
rules:
- if: $CI_COMMIT_BRANCH == "main"
Distinguish smoke (fast, blocking, happy path) from regression (broad, nightly, may be quarantined). Interviewers want to hear you cap smoke at ~10 scenarios and invest in observability for the long tail.
SaaS teams often run Playwright smoke in CI against ephemeral preview URLs (review apps), full regression nightly on staging, and three synthetic checks in prod (login, search, checkout). Visual regression only on static marketing site—app UI too dynamic.
Hard-coded sleep(5000) in E2E tests creates flakiness under load and slowness when fast. Use Playwright auto-waiting or Cypress retry-ability; never arbitrary sleeps.
Performance in CI
Functional tests prove correctness; performance tests prove the system still meets SLAs after each change. k6, Gatling, and JMH belong in CI as budget gates—not annual load-test theatre— with thresholds that block merges when p95 latency or throughput regresses beyond tolerance.
Threat model
A green functional suite hides N+1 queries, lock contention, and GC pauses that only appear at scale. One ORM change can double database round-trips; one unbounded cache can OOM the pod under Black Friday traffic. Without automated performance gates, regressions reach production and become emergency tuning sprints—while security patches wait in queue.
| Tool | Layer | Output | Typical CI cadence |
|---|---|---|---|
| k6 | HTTP/API load (JS scripts) | JSON summary, Grafana cloud | PR smoke (low VUs); nightly full load |
| Gatling | HTTP, WebSocket, scenario DSL | HTML report, JUnit-compatible | Release branch; JVM teams |
| JMH | Micro-benchmark hot paths | ns/op, allocations | On crypto/serialization module changes |
Performance budgets
Define budgets per endpoint or user journey: p95 latency < 300 ms, error rate < 0.1%, throughput ≥ 500 RPS at 50 VUs. Store baselines in Git; fail CI when regression exceeds 10% without approved exception. Align budgets with SLOs—CI thresholds should be stricter than prod alerts to catch drift early.
// perf/smoke.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
vus: 20,
duration: '60s',
thresholds: {
http_req_duration: ['p(95)<300'],
http_req_failed: ['rate<0.01'],
},
};
export default function () {
const res = http.get(`${__ENV.BASE_URL}/api/health`);
check(res, { 'status 200': (r) => r.status === 200 });
sleep(0.3);
}
JMH for JVM hot paths
JMH benchmarks serialization, hashing, and regex engines in isolated JVM forks. Run only when relevant paths change— full JMH suite can take 30+ minutes. Compare against stored baseline JSON; flag >5% regression.
Pipeline integration: threshold gates
name: Performance Gate
on:
pull_request:
paths: ['src/**', 'perf/**']
schedule:
- cron: '0 3 * * *'
jobs:
k6-smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: grafana/[email protected]
with:
filename: perf/smoke.js
env:
BASE_URL: https://staging.example.com
K6_CLOUD_TOKEN: ${{ secrets.K6_CLOUD_TOKEN }}
- name: Assert thresholds (k6 exits non-zero on breach)
run: echo "k6-action fails job if thresholds exceeded"
gatling-nightly:
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with: { distribution: temurin, java-version: 21 }
- run: mvn -B gatling:test
- uses: actions/upload-artifact@v4
with:
name: gatling-report
path: target/gatling/**/index.html
stages: [test, perf]
k6-smoke:
stage: perf
image: grafana/k6:latest
script:
- k6 run perf/smoke.js
variables:
BASE_URL: https://staging.example.com
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_PIPELINE_SOURCE == "schedule"
gatling-load:
stage: perf
image: eclipse-temurin:21
script:
- mvn -B gatling:test
artifacts:
paths: [target/gatling/]
expire_in: 7 days
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
jmh-benchmark:
stage: perf
script:
- mvn -Pjmh package
- java -jar target/benchmarks.jar -rf json -rff jmh-results.json
artifacts:
paths: [jmh-results.json]
rules:
- changes: [src/main/java/com/acme/crypto/**]
Perf tests on every PR add 2–5 minutes and flaky infra risk from shared staging. Nightly only delays feedback. Compromise: k6 smoke (20 VUs, 60s) on PR; full Gatling scenario nightly and before release tags.
k6 exits with code 99 when thresholds fail—wire that directly as job failure. Gatling generates simulation.log parseable to JUnit XML via plugins for unified dashboards.
Load tests against production without approval are indistinguishable from DDoS—block prod URLs in CI config validators. Rate-limit staging endpoints; use dedicated load-test service accounts with no PII in seed data.
Test Reporting
Tests without visible reports are noise in a log stream. Unified reporting—JUnit XML, coverage uploads, Allure HTML, GitHub PR annotations—turns failures into actionable feedback and gives security auditors evidence that gates ran on every merge.
Threat model
When test output lives only in ephemeral CI logs, developers skip reading them; managers assume green checkbox means tested; auditors cannot prove regression suites ran before a SOC2 sampling period. Missing reports also hide skipped security tests— a job that exits 0 because zero tests executed still looks successful.
| Format | GitHub Actions | GitLab CI | Best for |
|---|---|---|---|
| JUnit XML | actions/upload-artifact + check run publishers | Native artifacts:reports:junit | All JVM, pytest, Go test exporters |
| Cobertura / JaCoCo | Codecov, SonarCloud | Native coverage_report widget | MR diff coverage comments |
| Allure | Artifact + Pages deploy | Pages or artifact browser | Rich steps, attachments, history |
| Istanbul lcov | Codecov PR comment | Cobertura conversion | Frontend JS/TS projects |
JUnit XML: the lingua franca
pytest (--junitxml), Jest (jest-junit), Go (-json → converter), and Maven Surefire all emit JUnit-compatible XML. GitLab ingests natively into MR Test widget; GitHub uses third-party or dorny/test-reporter for PR annotations.
GitHub annotations
Publish test failures inline on PR diffs—developers fix without opening CI logs. Pair with actions/upload-artifact for Playwright traces and Allure ZIP on failure.
name: Test Report
on: [pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
npm ci
npm run test:ci -- --reporters=default --reporters=jest-junit
env:
JEST_JUNIT_OUTPUT_DIR: ./reports
JEST_JUNIT_OUTPUT_NAME: junit.xml
- name: Publish test report
uses: dorny/test-reporter@v1
if: always()
with:
name: Unit Tests
path: reports/junit.xml
reporter: jest-junit
- name: Upload Allure
if: always()
uses: actions/upload-artifact@v4
with:
name: allure-results
path: allure-results/
- name: JaCoCo to Codecov
uses: codecov/codecov-action@v4
with:
files: ./coverage/jacoco.xml
token: ${{ secrets.CODECOV_TOKEN }}
test:
stage: test
script:
- pytest --junitxml=report.xml --cov=app --cov-report=xml
- generate-allure-report || true
coverage: '/TOTAL.*\s+(\d+%)$/'
artifacts:
when: always
expire_in: 14 days
paths:
- report.xml
- coverage.xml
- allure-report/
reports:
junit: report.xml
coverage_report:
coverage_format: cobertura
path: coverage.xml
Allure for narrative debugging
Allure attaches screenshots, HTTP logs, and step hierarchies—essential for E2E failure triage. Generate on every run; publish HTML to GitLab Pages or S3 on main. Do not store Allure on every PR forever— expire_in: 7 days on artifacts.
JaCoCo & Istanbul in merge gates
Upload coverage to SonarQube or Codecov for diff coverage comments. Block merge when overall or new-code coverage drops. JaCoCo XML path: target/site/jacoco/jacoco.xml; Istanbul: coverage/lcov.info or Cobertura export for GitLab native widget.
$ pytest --junitxml=report.xml --cov=app --cov-report=html $ allure serve allure-results/ $ open target/site/jacoco/index.html $ npx nyc report --reporter=lcov
Set artifacts:when: always in GitLab so failed test XML still uploads— otherwise you debug blind on red jobs. On GitHub, mirror with if: always() on upload steps.
Enterprise audit packs bundle JUnit XML + JaCoCo + pipeline URL per release tag. Automate export in release job— auditors should not SSH into Jenkins history from 2019.
Explain how you'd prove to a CISO that tests ran: immutable CI logs, signed artifacts, JUnit XML retention policy, and branch protection requiring specific check names—not "we usually run tests."
Flaky Tests
A flaky test passes and fails on the same commit—destroying trust in CI faster than no tests at all. When developers habitually re-run pipelines until green, security gates become optional. Treat flakiness as an incident: quarantine, root-cause, fix or delete—never normalize retries.
Threat model: trust erosion
Flaky tests are a supply-chain vulnerability for process integrity. The attack does not require a hacker—teams internalize "CI is broken again" and merge with admin override, skip failing jobs, or disable checks in branch protection. A real SQL injection finding in SAST gets lumped with another random E2E timeout and ignored. Measuring flake rate (failures that pass on retry without code change) is as important as coverage percentage.
| Root cause | Example | Fix | Layer |
|---|---|---|---|
| Timing / async | Assert before React re-render | Playwright auto-wait; await expect.poll | E2E |
| Shared state | Global counter; static clock | Isolate fixtures; inject dependencies | Unit / integration |
| Order dependence | Test B assumes Test A ran first | Randomized order (pytest-randomly); independent setup | All |
| Infra / resource | Docker pull timeout; port collision | Pre-pull images; Testcontainers reuse | Integration |
| External dependency | Calls api.weather.com in test | WireMock; contract stubs | Integration / E2E |
| Data collision | Two tests create user [email protected] | UUID suffix factories; transactional rollback | Integration |
Quarantine workflow
- Detect — track retry-success rate per test ID in CI analytics (Buildkite Test Engine, GitLab Flaky Test report).
- Quarantine — move to @quarantine tag; exclude from merge-blocking suite; file ticket with owner.
- SLA — fix within 5 business days or delete the test (no infinite quarantine).
- Verify — run quarantined tests 50× in isolation before returning to blocking suite.
- Postmortem — if flake rate > 2% of runs, pause feature work for stability sprint.
flowchart TD
F[Flaky failure detected] --> R{Retry passed?}
R -->|Yes| Q[Quarantine test\nnon-blocking]
R -->|No| B[Blocking failure\nfix normally]
Q --> T[Jira ticket + owner]
T --> FIX[Root cause fix]
FIX --> SOAK[50x soak run]
SOAK -->|pass| UNQ[Restore to blocking]
SOAK -->|fail| Q
Q --> SLA{SLA exceeded?}
SLA -->|Yes| DEL[Delete or rewrite test]
Retry anti-pattern
CI-level --retries=3 on every test hides flakiness and multiplies pipeline duration. Acceptable only for known infra blips (container pull) with metrics exported—never as permanent policy. Jest flaky: true and Playwright retries should be zero on merge-blocking smoke; use separate quarantine job with retries for investigation.
| Approach | Short-term effect | Long-term effect |
|---|---|---|
| Blind retries | Green pipeline | Trust collapse; real bugs masked |
| Quarantine + SLA | Occasional yellow quarantine job | Stable blocking suite; accountable owners |
| Delete flaky E2E | Less coverage | Forces replacement with unit/integration |
| Stability sprint | Slower features 1 week | DORA recovery; security gates respected again |
Pipeline integration: detect without normalizing
name: Flake Detection
on:
schedule:
- cron: '0 4 * * 1-5'
jobs:
soak-unit:
runs-on: ubuntu-latest
strategy:
matrix: { run: [1,2,3,4,5,6,7,8,9,10] }
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run test:unit -- --retries=0
blocking-smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx playwright test --grep @smoke --grep-invert @quarantine --retries=0
# No retries — failures are real or quarantine candidate
test-blocking:
stage: test
script:
- pytest -m "not quarantine" --retries=0
artifacts:
reports:
junit: report.xml
test-quarantine:
stage: test
allow_failure: true
script:
- pytest -m quarantine -v
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
# GitLab 16+ flaky test detection surfaces in MR widget automatically
# when junit history shows inconsistent results
Branch protection bypass and allow_failure: true on security test jobs are equivalent risks— attackers time merges when teams routinely bypass red CI. Flaky security tests must be quarantined and fixed at P1, not left failing.
Marking tests @quarantine without allow_failure separation means they still block merges while flaking—worst of both worlds. Split jobs: blocking (no quarantine tag) vs investigative (quarantine only).
GitLab compares JUnit outcomes across pipelines for the same commit SHA to flag flaky tests. Buildkite Test Analytics hashes test name + file for trend lines. Export the same JUnit XML to either—format is the integration point.
Teams that recovered from "CI is a joke" culture instituted a rule: third flake in a week → test deleted or rewritten same sprint. E2E count dropped 40%; unit count rose; incident rate fell. Security scan failures started getting fixed again.
Zero tolerance flakiness slows feature velocity while stability work clears backlog. Retries everywhere ships faster until a real regression and a real CVE both slip through on the same re-run green. Choose discipline.
Learning path: CI Pipelines → Testing Strategy (this page) → Security Scanning. Green functional tests do not replace SAST—both gates must be trusted.