End-to-end tests that pass locally and fail in CI are one of the most frustrating failure modes in test automation. The test logic looks correct, the app works in the browser, and the failure seems to appear only after a pipeline runs in a clean environment. That pattern is common enough that many teams stop trusting their suites, then start treating test failures as noise instead of signals.

The better approach is to separate the problem into a few categories: timing issues, test data problems, browser differences, and CI environment drift. Once you can classify the failure, you can debug it quickly and decide whether to fix the test, fix the app, or harden the pipeline.

This guide is a practical checklist for teams trying to answer a specific question: why e2e tests fail only in ci. It focuses on reproducible triage, not vague advice like “add waits” or “make tests stable.” If you want a broader refresher on the discipline behind these checks, see software testing, test automation, and continuous integration.

Start with a triage rule, not a guess

When a test fails only in CI, resist the urge to immediately change selectors, add sleeps, or rerun the job until it turns green. Those actions often hide the underlying issue and make the suite harder to trust.

Use this triage rule instead:

First determine whether the failure is about time, state, browser behavior, or environment differences. Do not patch the symptom until the category is known.

That sounds simple, but it changes the workflow. Instead of reading the stack trace as a single event, inspect the failure as a sequence:

  1. Did the application render the expected UI?
  2. Did the test interact with the element too early or too late?
  3. Did the test operate on the wrong data or stale state?
  4. Did the browser or runtime differ from local development?
  5. Did the CI environment change network, filesystem, timezone, locale, or resource limits?

If you can answer those questions, you can usually isolate the root cause.

Build a reproducible failure packet

Before debugging the test itself, capture enough evidence to compare local and CI runs. Without that, you end up chasing a moving target.

A useful failure packet includes:

  • The exact test name and retry count
  • The CI job ID and commit SHA
  • Browser name and version
  • Test runner version
  • Screenshot at failure time
  • Video, if your runner supports it
  • Console logs from the browser
  • Network logs, especially failed requests and status codes
  • Application logs correlated to the test window
  • Environment variables relevant to the test

If the failure is intermittent, add timestamps to each significant test step. A log entry like “clicked submit” is less useful than “clicked submit at 12:04:31.218, then waited for confirmation”.

For browser-based tests, trace artifacts are often more useful than screenshots because they show order of operations, not just the final state. In Playwright, for example, tracing can expose whether a locator resolved too early or whether a navigation happened between steps.

import { test, expect } from '@playwright/test';
test('checkout flow', async ({ page }) => {
  await page.goto('/checkout');
  await page.getByRole('button', { name: 'Place order' }).click();
  await expect(page.getByText('Order confirmed')).toBeVisible();
});

That test looks fine, but if CI is slower than local or a modal animates into place, the click may fail or target the wrong element. Evidence matters more than intuition.

Step 1: rule out timing issues first

Timing problems are the most common reason e2e tests fail only in ci. CI is usually slower, less predictable, and more constrained than a local laptop. Parallel jobs compete for CPU, shared runners have noisy neighbors, and containers may have lower memory or different startup characteristics.

Look for these patterns:

  • The element exists, but is not yet visible or actionable
  • The test passes on rerun without code changes
  • The failure happens around navigation, animations, or async UI updates
  • A request finishes after the assertion runs
  • A loading spinner appears in CI but not locally
  • A modal, toast, or delayed render changes the DOM between steps

The key is to distinguish between fixed waits and condition-based waits. Fixed waits are brittle because they encode assumptions about how long the app should take. Condition-based waits express what the test actually needs.

Good waits, bad waits

Bad pattern:

typescript

await page.waitForTimeout(3000);
await page.getByRole('button', { name: 'Continue' }).click();

Better pattern:

typescript

await expect(page.getByRole('button', { name: 'Continue' })).toBeVisible();
await page.getByRole('button', { name: 'Continue' }).click();

Even better, wait for the business state that matters:

typescript

await expect(page.getByText('Payment method saved')).toBeVisible();

The difference is important. Visibility is a UI condition. Confirmation text is a domain condition. Domain conditions usually make tests more stable because they reflect when the app is actually ready.

Watch for race conditions in test setup

Timing bugs often start before the assertion. Common examples include:

  • Seeding data asynchronously, then starting the test too early
  • Logging in through the UI and immediately navigating to a protected page
  • Waiting for a network response that completes before the listener is attached
  • Creating test records, then querying them before the backend transaction is committed

If the setup is not deterministic, the rest of the test becomes a coin flip. For API-driven setup, prefer explicit API calls and verify the response before moving on.

import requests

resp = requests.post( “https://example.test/api/users”, json={“email”: “qa@example.com”, “role”: “admin”}, timeout=10, ) assert resp.status_code == 201 user_id = resp.json()[“id”]

When setup and assertions are both asynchronous, the failure can look like a UI bug even though the real problem is test orchestration.

Step 2: verify test data is truly isolated

The next major cause is test data problems. CI environments usually run from a clean slate, or at least a differently shaped slate than your local machine. That can expose hidden dependencies on seeded records, cached users, or state left behind by earlier tests.

Common data problems in CI

  • A test assumes a user already exists
  • Two tests reuse the same email address or account name
  • Cleanup is incomplete, so the second test sees records from the first
  • Fixtures differ between local and CI
  • Database seeding runs in a different order in CI
  • A shared environment has stale data from another pipeline

If a test passes locally after manual setup but fails in CI, the local setup probably filled in missing state that the test itself never created.

A stable e2e test should be able to create or request all the data it needs, then verify the result without depending on hidden history.

Make data unique and traceable

Use unique identifiers for each run, especially for email addresses, usernames, invoices, or project names. If the application requires unique values, include the CI run ID or timestamp in a controlled way.

typescript

const runId = process.env.CI_RUN_ID ?? Date.now().toString();
const email = `qa+${runId}@example.com`;

This does not solve every issue, but it prevents a common class of collisions. Better still, expose a test-only API or fixture system that creates the exact data shape required by the scenario.

Verify cleanup and test ordering

Data problems often become visible only when tests run in parallel. A test suite that passes serially can still be broken if two tests touch the same user, product, queue, or cart.

Check for:

  • Shared accounts used across tests
  • Mutable global config created by one test and consumed by another
  • Cleanup steps that run conditionally or not at all on failure
  • Order-dependent tests hidden by a lucky local execution order

If you see this, the fix is usually to make each test self-contained, or to move the shared setup to a dedicated fixture that is created once and never mutated.

Step 3: compare browser behavior, not just browser name

Many teams say “it works locally in Chrome” and assume the browser is the same in CI. Often it is not. The browser family may match, but the version, flags, GPU availability, viewport, fonts, or headless mode may differ.

  • Different browser version in local and CI
  • Headless mode versus headed mode differences
  • Mobile emulation or viewport mismatch
  • Font rendering differences causing layout shift
  • Locale affecting text wrapping or date formatting
  • Disabled GPU or hardware acceleration in CI
  • Security settings, certificate trust, or popup handling differences

This matters because e2e tests frequently depend on layout and interaction geometry. A button that is clickable locally may be below the fold in CI. A text label may wrap differently and change a locator match. An overlay may render because animation timing differs in headless mode.

Make browser conditions explicit

Lock the browser and runner versions where possible, and make viewport and locale settings explicit in test config.

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { viewport: { width: 1280, height: 720 }, locale: ‘en-US’, timezoneId: ‘UTC’ } });

If a test is sensitive to browser layout, this kind of explicitness reduces unknowns. It does not eliminate rendering differences, but it makes the variance easier to reason about.

Prefer resilient locators

If your test uses XPath based on text position, CSS selectors tied to layout, or brittle nth-child rules, CI may expose fragility that local runs mask. Prefer semantic locators when possible, such as roles, labels, test IDs, or stable data attributes.

For example, avoid selecting the third button in a toolbar if the actual target is a button named “Save draft”.

Step 4: inspect CI environment drift

CI environment drift means the test runs in a setup that is different enough from local development to change behavior. It is broader than browser drift. It includes runtime, network, filesystem, container image, secrets, and OS-level details.

Typical sources of CI drift

  • Node, Python, Java, or browser runtime versions differ
  • Docker image changes unexpectedly
  • Missing fonts, locales, or certificate bundles
  • Network latency or restricted outbound access
  • Lower CPU and memory limits
  • Different filesystem case sensitivity
  • Different default timezone
  • Secrets or environment variables not injected as expected
  • Third-party services rate limiting CI IP ranges

A test can fail because the application is broken under those conditions, or because the test assumes a richer environment than CI provides. Both are useful findings, but they require different fixes.

Reproduce CI locally as closely as possible

The fastest way to reduce drift is to run the same container image or the same runtime version locally. If CI uses Docker, try executing the test in that container on your machine.

docker run --rm -it -v "$PWD":/app -w /app node:20 bash -lc "npm ci && npm test"

If the failure disappears locally, compare environment variables, image layers, installed dependencies, and file permissions. If it persists, the problem is likely deterministic and easier to isolate.

Check for hidden assumptions in configuration

A lot of CI-only failures come from assumptions no one wrote down:

  • The app reads a file path that exists on laptops but not in containers
  • A test expects localhost callbacks that are not reachable from CI
  • A background job relies on a service account only available in one environment
  • A test uses a clock dependent on local timezone rather than UTC

If the test interacts with dates or scheduling, make time explicit. CI should not depend on the timezone of the runner.

A reproducible debugging sequence

When a failure happens, use the same sequence every time. That prevents random guesswork.

1. Re-run the single failing test with trace artifacts

Do not rerun the whole suite first. Reproduce the exact case.

2. Compare local and CI runtime details

Record browser version, environment variables, runner image, and time of failure.

3. Determine whether the failure is before or after the user action

If the click fails, the problem may be visibility or overlay timing. If the click succeeds but the assertion fails, the problem may be backend processing or data state.

4. Check request and response timing

Look for API calls that complete after the UI assertion. If needed, assert on network completion or server-side state before checking the DOM.

5. Inspect the test data path

Confirm the test created or selected the exact record it intended to use.

6. Verify the CI environment parity

Check browser version, container image, locale, timezone, and CPU limits.

7. Reproduce in a container or runner-like environment

If local execution passes, run the same image or pipeline step to narrow the drift.

8. Decide whether the test or the product is wrong

Sometimes CI reveals a real defect, not a flaky test. For example, an ordering bug in the app only appears under slower startup conditions. That is still valuable. The goal is not to make the test green at any cost, but to identify the real failure mode.

How to distinguish test bugs from product bugs

A common trap is to assume every CI failure means the test is unstable. That is not always true.

The test is probably the problem if:

  • It depends on arbitrary timing or fixed sleeps
  • It uses brittle selectors
  • It assumes hidden state
  • It fails only when run in parallel
  • It requires manual preparation not encoded in the test

The product is probably the problem if:

  • The UI becomes inconsistent under slower rendering
  • Requests are not idempotent and duplicate submissions occur
  • Race conditions appear between frontend and backend events
  • The app cannot recover from normal latency or background job delays
  • State transitions are not synchronized correctly

In other words, some CI-only failures are test design issues, and some are production reliability issues exposed by realistic conditions. Both deserve attention.

Practical fixes that usually help

Once you know the category, choose a fix that matches it.

For timing issues

  • Replace sleeps with condition-based waits
  • Wait for domain events, not just DOM changes
  • Assert that navigation or API completion actually happened
  • Reduce reliance on animations in critical flows
  • Use test-only hooks sparingly, and only when they make synchronization observable

For test data problems

  • Generate unique data per run
  • Create fixtures through APIs when possible
  • Reset or isolate databases between tests
  • Avoid shared mutable accounts
  • Remove ordering dependencies

For browser differences

  • Pin browser and runner versions
  • Standardize viewport, locale, and timezone
  • Use semantic locators
  • Avoid assuming pixel-perfect layout
  • Test in headed and headless modes when behavior differs

For environment drift

  • Run tests in the same container image as CI
  • Make all important config explicit
  • Log key environment variables at test startup
  • Normalize time, locale, and filesystem assumptions
  • Add health checks before the suite begins

Example: a flaky checkout test and its likely causes

Imagine a checkout test that passes locally but fails in CI on the assertion that the order confirmation page is visible.

A useful diagnosis might go like this:

  • If the click on “Place order” fails, the button may be covered by a toast or animation.
  • If the click works but the confirmation never appears, the backend may be slower in CI or a test record may be invalid.
  • If the confirmation appears in screenshots but the assertion fails, the selector may be brittle or the page may render differently in headless mode.
  • If the request returns 500 only in CI, the environment may be missing a secret, a payment stub, or a database migration.

That sequence is more actionable than simply calling the test flaky.

A compact CI failure checklist

Use this checklist when a test fails only in CI:

  • Does the failure disappear on rerun without code changes?
  • Is there any fixed sleep that could be replaced with a condition?
  • Does the test rely on shared state or pre-existing data?
  • Are the CI browser version and local browser version identical?
  • Are viewport, locale, and timezone explicit?
  • Is the test running in a container or environment similar to CI?
  • Are network calls, background jobs, or async UI transitions still in flight when the assertion runs?
  • Does the test use stable selectors and data-independent setup?
  • Is the failure actually exposing an application race condition?

If you cannot answer one of these, gather more evidence before changing code.

When to invest in better tooling

If your team spends a lot of time debugging flaky tests in CI, the problem may be partly tooling. Look for support in your test runner for trace capture, parallel isolation, retries with artifact collection, and easy container execution. For teams managing many suites across browsers and services, test management and CI observability become part of the reliability budget, not an afterthought.

The best tooling does not eliminate discipline. It makes the failure mode visible faster.

Final takeaway

When you ask why e2e tests fail only in ci, the answer is usually not one thing. It is typically a combination of slower timing, weaker data isolation, browser differences, and hidden environment assumptions. The fastest path to a fix is to classify the failure first, then debug the specific category with reproducible evidence.

If your team adopts one habit from this guide, make it this: collect enough data to compare local and CI runs step by step. That habit turns flaky tests in ci from a guessing game into a tractable engineering problem.