Deployment Troubleshooting Checklist

A reusable deployment troubleshooting checklist for failed releases, broken CI/CD runs, DNS issues, and rollback decisions.

When a release fails, the team usually loses time in two places: guessing and switching context. This checklist is designed to reduce both. Use it as a repeatable deployment troubleshooting guide for failed launches, broken CI/CD runs, partial rollouts, and emergency rollbacks. Instead of starting from symptoms alone, work from the top down: confirm what changed, identify where the failure occurred, isolate blast radius, and verify the supporting layers around your app, including DNS, networking, secrets, build artifacts, runtime configuration, and health checks. The goal is not just to fix one failed deployment, but to make the next incident faster to diagnose.

Overview

This article gives you a reusable release failed checklist you can return to whenever production, staging, or preview deployments break. The checklist is organized in the order that tends to save the most time during incident response.

Before changing anything, pause and answer four basic questions:

What changed? Identify the commit, image tag, workflow run, infrastructure change, DNS update, or secret rotation associated with the failure.
Where did it fail? Was the failure in build, test, artifact publishing, deployment, startup, routing, DNS, SSL, or application runtime?
Who is affected? Confirm whether the problem impacts all traffic, one environment, one region, one hostname, or only background workers.
What is the safest next move? Decide whether to roll back immediately, freeze changes, route traffic away, or continue investigating with the current release in place.

A useful mental model is to separate deployment problems into layers:

Pipeline layer: CI jobs, test steps, build steps, packaging, artifact upload, registry access.
Infrastructure layer: compute instances, containers, orchestration, storage, permissions, networking, reverse proxy, load balancer.
Delivery layer: DNS, CDN, SSL, routing rules, health checks, traffic shifting.
Application layer: environment variables, migrations, startup commands, runtime dependencies, external APIs, feature flags.

If you classify the failure early, you avoid wasting time debugging the wrong system.

Checklist by scenario

Use the scenario that matches the symptom you see first. If multiple symptoms overlap, start with the earliest failure point in the chain.

1. The CI/CD pipeline failed before deployment started

Open the exact workflow run, not a later rerun, and identify the first failing step.
Confirm whether the failure is deterministic or intermittent by checking recent runs on the same branch or commit.
Review changes to build tooling, lockfiles, package manager versions, Docker base images, and CI runner images.
Check whether required secrets or environment variables are available in that environment and branch scope.
Verify repository permissions for package registries, artifact stores, and deployment tokens.
Confirm that caches did not hide a dependency issue. Retry with caches disabled if needed.
Check for expired credentials used to publish images, packages, or static assets.
Compare the failing workflow with the last successful run to spot configuration drift.

If your deployment process uses GitHub Actions, it helps to keep the workflow definition small and version-pinned. For a more structured build-and-release flow, see GitHub Actions Deployment Guide: Build, Test, and Deploy Web Apps Reliably.

2. The deployment completed, but the app will not start

Check container logs, process logs, or service manager logs for startup exceptions.
Confirm the startup command or entrypoint has not changed unexpectedly.
Verify environment variables are present and correctly named. A typo in one secret can look like a broader outage.
Check runtime compatibility: language version, binary architecture, system libraries, or container image mismatch.
Verify application ports. Confirm the app is listening on the port expected by the platform, proxy, or container runtime.
Check whether the release depends on a migration or background service that failed earlier.
Validate file permissions and mounted paths for uploads, temp files, certificates, or compiled assets.
Review memory and CPU limits. Some apps fail only after deployment because they cannot initialize within resource limits.

If your app runs behind Nginx, trace both the upstream process and the proxy rules. This is a common place for partial failures. See Nginx Reverse Proxy Setup Guide for Node.js, Docker, and SSL.

3. The app is running, but users get 502, 503, or timeout errors

Confirm whether the proxy or load balancer can reach healthy upstream instances.
Check health check paths, expected status codes, and timeout thresholds.
Verify service discovery or upstream hostnames still resolve correctly.
Inspect application response times after startup. A deployment may be technically live but too slow to pass health checks.
Check database connection pool limits and downstream API rate limits.
Look for long-running migrations or cold starts blocking request handling.
Review reverse proxy buffer limits, header sizes, and websocket settings if relevant.
Compare error rates by endpoint. A broad timeout pattern often points to infrastructure; one route failing often points to app logic.

4. The new release is live, but the wrong version is being served

Confirm image tags, release IDs, and build artifacts actually match the intended commit.
Check whether the deployment platform reused a previous artifact because of caching or an incomplete rebuild.
Verify traffic routing rules across environments. Staging and production aliases are easy to mix up.
Check CDN and edge cache behavior for static assets and HTML responses.
Confirm that cache-busting file names changed as expected after the build.
Review blue-green or canary routing settings. Some traffic may still be pinned to an older target.
Check feature flags. Users may report an old experience when the code is current but the flag state is not.

5. The app works internally, but the domain does not resolve or routes incorrectly

Confirm the domain points to the correct nameservers and provider.
Verify A, AAAA, CNAME, and TXT records in the authoritative zone, not just in a local DNS cache.
Check whether the hostname is proxied, flattened, or otherwise transformed by the DNS provider.
Review TTL values and recent DNS changes that may still be propagating.
Confirm that subdomains and root domains are configured separately where required.
Check for typo-level mistakes: missing trailing values, duplicate records, or records created in the wrong zone.
Validate that the hosting platform expects a CNAME, A record, or nameserver delegation for the specific setup.

For domain-specific debugging, pair this checklist with How to Use Dig, Nslookup, and Whois to Troubleshoot Domain Problems and How to Point a Domain to a Server: A Record, CNAME, Nameservers, and TTL Explained.

6. The site resolves, but SSL or HTTPS breaks after release

Confirm the certificate covers the hostname being served.
Check whether the certificate was issued for the current DNS and proxy setup.
Verify HTTP to HTTPS redirects do not create loops between app, proxy, and CDN.
Confirm origin certificates, full-chain bundles, and key files are in the expected paths.
Review TLS mode settings if a CDN or proxy sits in front of the origin.
Check whether a new environment or subdomain was added without corresponding certificate automation.

If your failure appeared right after a DNS or hosting change, see How to Fix ERR_SSL_PROTOCOL_ERROR After DNS or Hosting Changes.

7. The deployment broke background jobs, queues, or scheduled tasks

Confirm worker processes deployed alongside the web app and are running the new code.
Check queue connection strings, credentials, and network access.
Review message schema changes that may be incompatible with older workers or producers.
Check cron or scheduler definitions that may still target old paths or commands.
Confirm idempotency where retried jobs may cause repeated failures after rollout.
Review dead-letter queues and retry spikes for clues that the issue began before users noticed it.

8. The deployment failed after a database change

Check whether migrations ran successfully and in the right environment.
Confirm the application version is compatible with both pre-migration and post-migration states if you use rolling deploys.
Review migration duration, locks, and timeouts.
Check for destructive changes such as dropped columns, renamed tables, or constraint updates that old code still expects.
Verify database user permissions for schema changes and runtime access.
Decide early whether rollback is safe. Not every schema change is easily reversible.

9. The release works in staging but fails in production

Diff environment variables between environments.
Compare network boundaries, IP allowlists, VPC access, and firewall rules.
Review production-only middleware, rate limits, CDN rules, or WAF behavior.
Confirm production data volume is not triggering query plans, memory use, or timeout patterns absent in staging.
Check whether production uses different third-party credentials, callback URLs, or webhook endpoints.

What to double-check

These are the items teams often assume are fine because they were fine yesterday. In deployment troubleshooting, assumptions are expensive.

Release identity

The commit SHA in the deployment matches the commit you intended to release.
The artifact or container image was built from that commit, not merely tagged with the same version label.
The environment received the latest artifact and did not fall back to an earlier one.

Secrets and configuration

Secrets exist in the target environment and are not empty, expired, or rotated without redeploying.
Variable names match exactly, including case.
Environment-specific values such as API base URLs, database hosts, and bucket names are correct.

Network and routing

Security groups, firewall rules, and internal routing still allow required traffic.
Reverse proxy upstreams point to the current service name and port.
Health check paths are lightweight and return expected codes quickly.

DNS and hostname setup

The domain is managed in the expected provider and zone.
Records exist at the authoritative source, not only in a local cache.
Recent provider changes, proxy settings, and TTL values are accounted for.

If DNS configuration is part of your release process, it helps to standardize changes as code. See Terraform DNS Records Guide: Manage Cloudflare and Route 53 as Code and, for provider selection context, Best DNS Providers for Developers: Cloudflare vs Route 53 vs Namecheap vs Others.

Observability

Logs are available for both old and new instances.
Metrics cover deployment time, startup time, error rate, latency, and health check failures.
Alerts are not only firing but also pointing to the right service and environment.

Rollback readiness

You know the last good version.
You know whether rollback is safe with the current database state.
You know what manual cleanup is required if the failed release partially applied changes.

Common mistakes

This section highlights patterns that repeatedly slow down teams trying to fix failed deployment incidents.

Changing multiple things at once during diagnosis. If you redeploy, edit DNS, rotate secrets, and restart services in one burst, you lose the ability to identify the real cause.
Debugging from user reports instead of the release record. Start with the deployment timeline. User symptoms often appear later than the actual failure.
Assuming the app is broken when the delivery layer is broken. A healthy app behind a bad DNS record, SSL mismatch, or proxy misroute can look fully down from the browser.
Ignoring the first failing event. Secondary errors pile up quickly. The earliest error in the chain is usually the most valuable one.
Trusting local cache behavior. DNS caches, browser caches, package caches, and container layers can all mask the current state.
Skipping environment parity checks. “Works in staging” often means staging is different in ways that matter more than the code itself.
Rolling back without checking migration compatibility. A fast rollback can turn a limited issue into a data problem.
Not documenting the fix path. Incidents repeat. If the resolution stays in chat logs only, the next failure starts from zero again.

For container-based apps, keeping deployment steps predictable matters as much as keeping them simple. If you need a baseline production workflow, Docker Deployment Tutorial for Small Production Apps is a useful companion.

When to revisit

This checklist is most valuable when you treat it as a living operational document. Revisit and update it before the next high-risk release, not after the next outage.

Review this checklist when any of the following changes happen:

You adopt a new CI/CD platform, runner image, or deployment workflow.
You change DNS providers, hosting platforms, reverse proxies, or load balancers.
You add new environments, regions, subdomains, or SSL termination points.
You introduce database migration tooling, background job systems, or feature flag platforms.
You move secrets management to a new system or rotate critical credentials.
You update on-call procedures or incident ownership.
You are preparing for a high-traffic period, planned launch, or seasonal release cycle.

A practical way to operationalize this article is to turn it into a one-page preflight and incident document:

Create a release template with fields for commit SHA, artifact ID, migration status, feature flags, DNS changes, and rollback target.
Store links to logs, dashboards, proxy configs, and workflow runs in one place.
Decide in advance who can approve rollback, who can change DNS, and who owns database actions.
After each incident, add one new check that would have shortened diagnosis.

If your team repeatedly hits domain-related release issues, keep separate runbooks for DNS resolution, SSL errors, and proxy routing. Helpful references include How to Fix DNS_PROBE_FINISHED_NXDOMAIN for Websites, APIs, and Local Development and, if email delivery is affected by domain changes during a launch, How to Set Up MX Records for Custom Email Domains.

The main habit to build is simple: during a failed deployment, work the checklist in order, record what you verified, and avoid improvising changes until you know which layer failed. That discipline shortens incidents and improves every release after it.

Deployment Troubleshooting Checklist: What to Check When a Release Fails