Flaky end-to-end tests are one of the fastest ways to erode trust in a CI pipeline. A suite that passes locally, fails on a rerun, and occasionally goes green after a minor code change is not just annoying, it makes teams ignore signal. In GitHub Actions, browser tests can fail for reasons that never show up on a developer laptop: slower CPU, different fonts, missing system packages, network variance, viewport differences, timing issues, or test data that is not isolated enough.

If your goal is to stabilize flaky E2E tests in GitHub Actions, the fix is rarely a single retry flag. You need a disciplined process for isolating the failure mode, collecting enough evidence from CI, and then tightening the test and environment until the suite behaves deterministically. This guide walks through that process with practical steps for Playwright, GitHub Actions, and browser-heavy workflows.

A flaky test is usually a symptom, not the root cause. Treat it like a debugging problem, not a tolerance problem.

Why tests pass locally but fail in GitHub Actions

Local runs and CI runs are different execution environments, even when the code is identical. A browser test can be sensitive to small shifts in timing or rendering because it depends on asynchronous UI state, network responses, animation frames, hydration, and DOM updates. In CI, those dependencies often become less predictable.

Common differences include:

  • CPU and memory contention on hosted runners
  • Headless browser behavior versus local headed mode
  • Different screen sizes and device scale factors
  • Missing fonts or OS packages
  • Slower startup of app servers and test fixtures
  • Authentication flows that depend on external services
  • Test order dependence or shared state
  • Timeouts that are reasonable on a workstation but too tight for CI

If the same test passes locally and fails only in GitHub Actions, do not assume GitHub Actions is broken. Usually the test is making an assumption that local execution accidentally satisfies.

Start with the failure pattern, not the retry count

A common mistake in flaky test debugging is to add retries immediately. Retries can reduce noise, but they also hide the pattern. Before changing the test, answer three questions:

  1. Does the failure happen on the first run, only under load, or only on specific branches?
  2. Is the failure always the same assertion, or does it move around?
  3. Is the browser failing to find an element, timing out, or asserting the wrong state?

Those answers tell you whether the issue is more likely a selector problem, a synchronization problem, a data problem, or an environment problem.

For example:

  • Timeout 30000ms exceeded while waiting for selector usually points to timing, visibility, or routing issues
  • Expected text to be ... but received ... often points to stale state, race conditions, or cross-test contamination
  • page crashed or browser process exits often point to resource constraints or missing system dependencies

In other words, debug the failure class first, then decide whether a retry is acceptable as a temporary mitigation.

Make GitHub Actions tell you more

The default CI output is usually not enough for browser test failures. You want logs, traces, screenshots, videos, and the exact browser context that failed.

With Playwright, turn on the artifacts that let you reconstruct the failure later:

import { defineConfig } from '@playwright/test';

export default defineConfig({ use: { trace: ‘retain-on-failure’, screenshot: ‘only-on-failure’, video: ‘retain-on-failure’ } });

Then make sure GitHub Actions uploads those artifacts even when the job fails:

- name: Upload Playwright artifacts
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: playwright-artifacts
    path: |
      playwright-report/
      test-results/

That combination is often enough to answer questions like:

  • Did the element exist but remain hidden?
  • Was the app still loading data when the assertion ran?
  • Did navigation happen before the click completed?
  • Did the wrong environment variable point to a staging API instead of a mocked endpoint?

If you only keep one debugging habit, make it artifact collection.

Inspect the CI runtime as part of the test

Many teams assume the runner is a neutral execution box. It is not. The browser and the app run inside a specific operating system image with specific packages and browser versions.

Capture the runtime details early in the job:

- name: Show environment details
  run: |
    node --version
    npm --version
    npx playwright --version
    uname -a
    cat /etc/os-release

This helps when failures correlate with a browser update, an image refresh, or a package mismatch. It also helps when a test fails because a dependency such as a font renderer, shared library, or media package is missing.

If your application relies on specific browser dependencies, install them explicitly rather than assuming the runner image contains everything you need. On GitHub-hosted Linux runners, use the official browser install guidance from Playwright and avoid ad hoc system changes unless you know why they are needed.

Align local and CI browser settings

A large class of flaky E2E tests comes from environmental drift. The test is not really flaky, it is under-specified. It passes in one browser context and fails in another.

Check these settings first:

  • viewport size
  • locale and timezone
  • device scale factor
  • headless versus headed mode
  • browser channel and version
  • permissions, geolocation, and storage state

For example, a responsive layout can change button position or hide controls in a narrower viewport. A date picker can render different days if timezone differs. A flaky tooltip assertion can fail when the app is rendered with a different font stack.

A useful pattern is to set explicit defaults in the test config rather than relying on local browser state:

use: {
  viewport: { width: 1440, height: 900 },
  locale: 'en-US',
  timezoneId: 'UTC'
}

This does not eliminate every issue, but it reduces accidental variability and makes failures reproducible.

Prefer explicit waits over sleep, but do not overuse waits either

Blind delays are one of the most common sources of CI unreliability. A waitForTimeout(2000) may hide a race locally while still failing on a slower runner. But the opposite mistake is just as bad, using a chain of overly specific waits that wait for the wrong thing.

The goal is to wait for the application state that actually matters.

Instead of waiting for a fixed delay, wait for:

  • a visible element that signals the page is ready
  • a network request or response that must complete
  • a URL change that confirms navigation succeeded
  • a stable text value that reflects actual data load

Example:

typescript

await page.getByRole('button', { name: 'Save' }).click();
await expect(page.getByText('Saved successfully')).toBeVisible();

This is better than sleeping because the assertion itself becomes the synchronization point.

If a wait is not tied to observable app state, it is usually guessing.

Fix selectors before you tune retries

A flaky locator often looks like a timing issue. In reality, the test is selecting the wrong node or a node that changes too often. If your selectors are brittle, CI makes that brittleness easier to see.

Prefer stable locators in this order:

  1. Accessible roles and names
  2. Test IDs that are intentionally added for automation
  3. Data attributes that do not change with UI redesigns
  4. Text selectors only when the text is stable

For Playwright, role-based locators are usually the first option:

typescript

await page.getByRole('button', { name: 'Checkout' }).click();

Avoid selectors tied to CSS structure, deeply nested DOM paths, or generated class names. Those tend to break when layout, component libraries, or rendering strategies change.

When a selector fails in CI, ask whether the element is absent, duplicated, hidden, or replaced by another state. The failure mode matters. A good debugging habit is to inspect the trace and DOM snapshot before changing the test code.

Reduce shared state between tests

Shared state is a frequent source of test nondeterminism. A suite can appear stable until GitHub Actions executes tests in parallel, in a different order, or with a different worker count.

Common shared-state problems include:

  • reused accounts or seeded records
  • the same email address or username across tests
  • app state stored in localStorage or cookies across tests
  • backend records created in one test and cleaned up too late
  • test data reset that depends on eventual consistency

To stabilize the suite, make each test own its data:

  • create a unique user per test run
  • seed data through an API fixture
  • isolate storage state per spec or per worker
  • clean up resources by ID instead of by query broad enough to affect other runs

If your app has a backend API, create test setup and teardown paths that are deterministic. If you cannot fully isolate the environment, at least namespace the data by run ID or worker ID.

Use retries as a signal, not a permanent crutch

GitHub Actions can rerun failed jobs, and Playwright can retry failed tests. Those features are useful, but they should be treated differently.

Playwright retries help you identify whether the failure is transient. If a test passes on retry, you still need to know why it failed first. A CI job retry, on the other hand, may confirm that the environment had a temporary problem, but it does not tell you whether the test itself is robust.

A reasonable approach is:

  • use a small number of retries while investigating
  • collect traces and screenshots on first failure
  • track which tests need retries repeatedly
  • remove retries once the root cause is fixed

Here is a minimal Playwright retry setting:

typescript retry: process.env.CI ? 2 : 0

This can keep the pipeline usable while you debug, but it should not become the long-term definition of success. If a test needs retries every week, it is still flaky.

Watch for browser timing issues in modern frontends

Browser timing issues are more common in component-heavy apps that use hydration, transitions, client-side routing, optimistic UI, or lazy loading. In CI, these can all stretch just enough to expose race conditions.

Typical examples:

  • clicking before a button is truly enabled
  • asserting text before the request completes
  • reading a value before a debounced update lands
  • waiting for a route change when the app uses SPA navigation without a full reload
  • testing animation state instead of final state

A safer pattern is to wait for the semantic end state, not the transition itself. For example, if a button click triggers a save, verify that the save completes by waiting for the success indicator or updated record, not by assuming the click is enough.

If animations interfere with stability, consider disabling them in test mode. That is especially useful when the visual transition is not part of the behavior under test.

Make GitHub Actions jobs easier to reproduce locally

A reliable CI setup is one you can reproduce on your own machine. If a failure only exists in the cloud, you will spend more time guessing.

Make your workflow explicit and close to local commands:

name: e2e

on: [push, pull_request]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 cache: npm - run: npm ci - run: npx playwright install –with-deps - run: npm test:e2e

This makes the CI path obvious, but it also creates a local reproduction checklist:

  • run the same Node version
  • install dependencies with npm ci
  • install the same browsers
  • use the same test command
  • match the same environment variables

If a failure occurs only in GitHub Actions, compare the local and CI commands line by line. Differences in package installation, browser install flags, or environment variables are often the real cause.

Add job-level diagnostics for network and app startup

Sometimes the browser is not the problem. The app under test starts too slowly, the API is unavailable, or the test runs before a service is ready.

Useful diagnostics include:

  • app server health checks before launching tests
  • logs from the frontend dev server or production container
  • API response codes from test setup calls
  • startup time for seeded services or mocks

If the application runs in a separate step or container, add a readiness check before E2E begins. For example, wait for the server to respond before launching browser tests.

for i in {1..30}; do
  curl -fsS http://localhost:3000/health && exit 0
  sleep 2
done
exit 1

That is preferable to hoping the browser waits long enough while the app boots. It also makes the failure point clearer when the app never becomes ready.

Use artifacts to debug without rerunning blindly

Artifact retention changes how quickly you can diagnose flaky test debugging issues. Without artifacts, every failure becomes a rerun. With artifacts, the first failure is often enough.

The most useful artifacts are:

  • Playwright trace files
  • screenshots on failure
  • videos for complex interaction bugs
  • test reports with timing information
  • job logs that include environment details

If a test intermittently fails on a visual assertion, compare screenshots from passing and failing runs. If a test fails after a click, open the trace and step through the exact browser state. If a test fails only in CI and not locally, inspect whether the DOM changed because of hidden overlays, loading spinners, or race conditions from network calls.

Decide when a retry is acceptable

Retries are not always wrong. The question is whether a retry masks a true product issue, a test issue, or an external dependency issue.

Retries are usually acceptable when:

  • a third-party service occasionally flakes and you cannot control it
  • the test validates a noncritical path and the retry rate is low
  • the retry is temporary while a root cause is being fixed

Retries are usually not acceptable when:

  • the same spec needs retries often
  • the failure is deterministic in a specific environment
  • the assertion is checking the wrong state
  • the suite depends on shared mutable data

A retry policy should be visible, documented, and reviewed. Otherwise a CI system can drift into silent instability where the pipeline appears green while actually hiding a recurring defect.

A practical stabilization checklist

If you need to stabilize flaky E2E tests in GitHub Actions quickly, use this checklist in order:

  1. Reproduce the failure in CI with full artifacts enabled
  2. Capture browser traces, screenshots, and videos on failure
  3. Log environment details, browser version, and runtime image
  4. Replace sleep-based waits with state-based assertions
  5. Tighten selectors to roles, labels, or stable test IDs
  6. Isolate test data and remove shared state
  7. Align viewport, locale, timezone, and browser versions
  8. Add a small retry budget only as a temporary guardrail
  9. Verify app readiness before browser tests begin
  10. Remove or reduce retries once the underlying issue is fixed

This order matters because retries and timeouts are the easiest knobs to turn, but they are not usually the first knobs that should be turned.

What a stable E2E suite looks like

A stable browser test suite is not one that never fails. It is one that fails for understandable reasons, surfaces good diagnostics, and behaves consistently across local and CI environments.

You know you are getting closer when:

  • failures point to a specific selector, request, or state transition
  • traces explain what the browser saw before the assertion failed
  • the same test fails the same way across reruns when a bug exists
  • the suite no longer depends on runner speed or lucky timing
  • retries become rare instead of routine

That is the real goal of CI reliability, not to eliminate every source of variance, but to make variance visible, bounded, and actionable.

Final thoughts

When browser tests pass locally and fail in GitHub Actions, the issue is usually some combination of timing, environment drift, test data isolation, and missing diagnostics. The fastest path to stability is not adding more retries. It is building enough observability to understand the failure, then tightening the test so it waits on the right thing and asserts the right behavior.

If you approach flaky E2E tests as a system problem, you will usually find that the fix is practical: better selectors, clearer readiness checks, fewer shared dependencies, and CI artifacts that tell the full story. That is how teams move from reactive reruns to dependable automation.

For deeper background on the execution model behind these pipelines, the official docs for GitHub Actions and the broader concepts of continuous integration and test automation are useful references.