DevOps Interview Questions (2026): 40 Real Questions + What Interviewers Grade

DevOps interviews are deceptive. The job title sounds like "ops with automation," but senior interviewers are grading something harder: whether you think in systems, own outcomes end-to-end, and have operated under real production pressure — not just configured pipelines in a sandbox.

The 40 questions below are drawn from real DevOps loops at companies from seed-stage startups to FAANG. For each one, the "what the interviewer is actually grading" note tells you the hidden bar — the thing that separates a hire from a strong no-hire.

CI/CD and Pipelines

1. Walk me through a CI/CD pipeline you designed from scratch. What decisions did you make, and what would you change?

What the interviewer grades: Ownership and trade-off reasoning. They want to hear that you chose specific tools for specific reasons (not just "GitHub Actions because everyone uses it"), that you made deliberate decisions about caching, parallelism, and artifact management, and that you can identify real weaknesses in your own design. Candidates who describe a pipeline they inherited but claim full design credit are caught quickly.

2. How do you handle secrets in a CI/CD pipeline?

What the interviewer grades: Security posture and production-readiness thinking. The passing bar: no plaintext secrets in env vars or repo; secrets injected at runtime via a secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager) or the CI system's encrypted secret store. They're also listening for "and we audit secret access" — that's senior signal.

3. A deployment to production takes 45 minutes. How would you reduce it?

What the interviewer grades: Structured diagnosis. Strong candidates immediately ask: "Where does the time go?" before proposing solutions. They think in phases (test, build, push, deploy) and identify the real bottleneck before prescribing layer caching, parallel test splitting, incremental builds, or selective deploys. Proposing random optimizations without profiling first is a red flag.

4. What's your strategy for zero-downtime deployments?

What the interviewer grades: Breadth of actual implementation experience. They expect you to name and distinguish blue/green, rolling, and canary — but the real bar is nuance: can you articulate when each approach fails? (Blue/green doubles your resource cost; rolling makes rollback slow if a bad build propagates; canary requires your traffic routing to support it.) Bonus: mentioning feature flags as a deployment primitive.

5. How do you prevent "works on CI, fails in prod" bugs?

What the interviewer grades: Environment parity thinking. Expected answers include ephemeral environments that mirror prod config, environment-specific config injection (never hardcoded), contract testing between services, and post-deploy smoke tests against the real environment. "Run more tests" without mentioning parity is weak.

6. Describe how you'd implement progressive delivery with feature flags in a microservices system.

What the interviewer grades: Experience deploying to real production with real risk controls. They want to hear about a feature flag service (LaunchDarkly, Unleash, custom), flag evaluation at the service boundary (not just the frontend), the strategy for targeting (% of users, specific cohorts), and how you define "done" for progressive rollout (metric stable → full ship vs. metric degraded → kill switch).

Kubernetes and Containers

7. Explain the difference between a Deployment, StatefulSet, and DaemonSet. When do you use each?

What the interviewer grades: Not terminology, but operational judgment. Deployment = stateless workloads; StatefulSet = stateful workloads where pod identity and persistent storage matter (databases, Kafka); DaemonSet = one pod per node (log collectors, monitoring agents). The bar is knowing WHY, not just WHAT — and a StatefulSet gotcha: ordering guarantees during rolling updates can slow things dramatically.

8. How does Kubernetes scheduling work, and how have you influenced it?

What the interviewer grades: Whether you've actually operated clusters under resource pressure. They expect: scheduler scores nodes using predicates (feasibility) and priorities (optimality); you influence it via resource requests/limits, nodeSelectors, affinity/anti-affinity rules, taints/tolerations, and PodDisruptionBudgets. Candidates who've only done default scheduling haven't operated at scale.

9. A pod is stuck in CrashLoopBackOff. Walk me through your debugging process.

What the interviewer grades: Systematic debugging discipline. The sequence: kubectl describe pod → events section (OOMKilled? Liveness probe failing? Image pull error?); kubectl logs <pod> --previous (last crash's logs); check resource limits (OOM); check the liveness/readiness probe config. Strong candidates narrate as they debug, not after — they show their work.

10. How do you handle configuration management across multiple Kubernetes clusters?

What the interviewer grades: Operational maturity. Expected tools: Helm with environment-specific values files, or Kustomize overlays, or GitOps (ArgoCD/Flux) with environment branches. The key insight interviewers listen for: config drift is the enemy — every cluster should be reconciled from source of truth, not manually patched.

11. What's your approach to Kubernetes resource requests and limits, and what happens when you get them wrong?

What the interviewer grades: Production scars. Right sizing: requests = what the pod needs in normal operation; limits = the ceiling. Getting it wrong: under-requesting = noisy-neighbor OOM kills; over-requesting = poor bin packing, wasted cost. No limits at all = one runaway pod OOMKills others. They want to hear that you monitor actual resource usage (VPA, metrics-server) and iterate.

12. How do you implement network policies in Kubernetes, and why do most teams skip them?

What the interviewer grades: Security depth. Most teams skip NetworkPolicy because the default is "allow all" and adding policy requires thinking about every service-to-service dependency — which is painful but necessary for zero-trust. Expected answer: default-deny-all ingress + explicit allow rules per workload, enforced by a CNI plugin that supports it (Calico, Cilium — not Flannel). Mentioning that policy is useless without a supporting CNI is senior signal.

Infrastructure as Code

13. Terraform vs. Pulumi vs. CDK — how do you choose?

What the interviewer grades: Opinionated thinking, not tool evangelism. Expected: Terraform = mature, declarative, language-agnostic, massive provider ecosystem, but HCL is limited for complex logic; Pulumi/CDK = real programming languages (loops, functions, types), better for teams that live in code and need complex abstractions. The real question is team skills + ecosystem needs, not which tool is "best."

14. How do you manage Terraform state safely in a team environment?

What the interviewer grades: Whether you've been burned by state corruption. Expected: remote state (S3 + DynamoDB locking, Terraform Cloud), never local state in repos, state locking to prevent concurrent applies, workspace or directory-per-env isolation (not one giant state file for everything), and periodic terraform refresh to catch drift.

15. Describe a time IaC changes caused a production incident. What happened?

What the interviewer grades: Honesty + incident ownership. "It's never happened" is a red flag at senior level — it means you haven't operated at scale. They want: what changed, what broke, how you diagnosed and rolled back or forward, and what the process change was. Candidates who blame others or minimize impact don't clear senior bar.

16. How do you test infrastructure code before applying it to production?

What the interviewer grades: IaC quality discipline. Passing bar: terraform plan reviewed by a second engineer, terraform validate, policy-as-code (OPA/Conftest or Sentinel for Terraform Cloud), and ideally a non-prod apply before prod. Strong candidates mention terratest or localstack for unit testing module behavior. "I just look at the plan" is insufficient.

Monitoring, Observability, and SRE

17. Explain the difference between metrics, logs, and traces. When do you need all three?

What the interviewer grades: Observability fluency. Metrics = aggregated numbers over time (throughput, error rate, latency p50/p99); logs = discrete events (errors, audit trail); traces = distributed request paths (what called what, and how long each hop took). You need all three for complex microservices incidents: metrics tell you something's wrong, logs tell you what error, traces tell you where in the call graph. Single-service apps often survive on metrics + logs alone.

18. How do you define and implement SLIs, SLOs, and error budgets?

What the interviewer grades: SRE thinking, not just terminology. SLI = the metric you measure (e.g., % of requests completing in <200ms); SLO = the target (e.g., 99.5% over 28 days); error budget = the 0.5% you're allowed to spend on incidents or risky deploys. The bar: error budgets should influence deploy decisions — if you're burning fast, you slow down releases. Teams that set SLOs but never consult the error budget haven't operationalized SRE.

19. Walk me through how you'd respond to a P1 incident where latency spiked 10x.

What the interviewer grades: Incident command discipline. Strong structure: (1) acknowledge + take incident commander role; (2) check recent changes (deploys, config, infra events) — most latency spikes correlate to a change; (3) check resource saturation (CPU, memory, DB connections, downstream service latency); (4) isolate scope (one region? one pod? one customer?); (5) mitigate first (rollback, scale, kill switch), investigate root cause after. Candidates who investigate before mitigating lose production minutes.

20. How do you build alerting that doesn't produce alert fatigue?

What the interviewer grades: Alert quality thinking. Alert fatigue root causes: alerting on symptoms that aren't actionable, alerting at too-tight thresholds, not routing alerts by severity, no runbooks. Fix: alert on symptoms not causes (users are affected, not CPU is high), multi-burn-rate alerts tied to SLO consumption, every alert must have a runbook link and an on-call owner. Bonus: mention alert review cadence (flapping alerts get silenced or deleted).

21. What's the difference between availability and reliability? How do you measure each?

What the interviewer grades: Precision of language. Availability = the service is up and reachable (measured as uptime %); reliability = the service does what it's supposed to do correctly and consistently (measured as error rate + correctness). A service can be 100% available but unreliable (always returns 200 but with corrupt data). Interviewers at SRE-focused companies care about this distinction.

22. How do you do post-incident reviews, and how do you make them blameless?

What the interviewer grades: Organizational maturity awareness. Expected: write a timeline (not a blame narrative), identify contributing factors (not root causes, since complex systems rarely have single root causes), extract action items with owners and deadlines. Blameless = focus on system/process fixes, not people. Red flag: "we figure out who caused it and retrain them." Senior signal: "we ask what system properties allowed this failure to reach users."

Networking and Security

23. How does service mesh work, and when is it worth the operational overhead?

What the interviewer grades: Whether you've evaluated the trade-off honestly. Service mesh (Istio, Linkerd) = sidecar proxies that give you mTLS, traffic management, observability at the network layer without app code changes. Worth it when: you have many services that need mutual TLS, fine-grained traffic control (A/B at mesh level), or L7 observability without instrumentation. NOT worth it for: small teams, few services, or teams without the ops maturity to debug proxy-level issues.

24. Explain how you'd implement least-privilege access in a cloud environment.

What the interviewer grades: IAM depth. Passing bar: service accounts/roles with only the permissions they need (not admin), no credentials in code or env vars (use IAM roles + IRSA or Workload Identity), VPC segmentation so services can't reach what they don't need, audit logging (CloudTrail, GCP Audit Logs) for all API calls. Strong signal: mention privilege escalation paths and how you prevent them.

25. What's your approach to secrets rotation, and how do you do it without downtime?

What the interviewer grades: Operational secrets hygiene. Rotation without downtime requires: application supports multiple valid secrets simultaneously during rollout (two-version key support), new secret written to Secrets Manager before old one expires, application re-reads the secret dynamically (not at startup only), old secret invalidated after rollout confirmed. Candidates who say "we rotate and restart pods" understand the mechanics but not the zero-downtime requirement.

Cloud Cost and Efficiency

26. Our cloud bill doubled in a month. How do you diagnose it?

What the interviewer grades: Cost debugging structure. Start with the billing console's "top cost drivers" breakdown (service, region, account). Look for: new services/resources (cost anomaly alerts help), forgotten resources (snapshots, old EBS volumes, idle NAT gateways), transfer costs (cross-AZ, cross-region, egress), and compute scaling (did autoscaling go wild?). Strong candidates mention cost anomaly alerts and budget alarms as prevention, not just diagnosis.

27. How do you right-size compute resources across a large fleet?

What the interviewer grades: Cost engineering maturity. Tools: CloudWatch/Cloud Monitoring utilization metrics + AWS Compute Optimizer / GCP Recommender, Datadog/Grafana resource utilization dashboards, VPA for Kubernetes. Strategy: establish a baseline (30-day p95 CPU/memory), size to p95 not p100, and use spot/preemptible for fault-tolerant workloads. Bonus: mention FinOps practice (shared cost visibility, team-level chargeback).

Systems Thinking and Architecture

28. How do you design for failure in a distributed system?

What the interviewer grades: The depth of failure-mode thinking. They expect: retries with exponential backoff and jitter (to avoid thundering herd), circuit breakers (to stop hitting a failing dependency), bulkheads (to isolate failures so one service's outage doesn't cascade), timeouts at every network call (default is no timeout = hung goroutines/threads), idempotency for safe retries. The test: can you give a concrete example of each from something you've built?

29. How do you think about database migrations at scale with zero downtime?

What the interviewer grades: Production database operational experience. The expand-contract pattern: (1) add new column nullable; (2) deploy code that writes to both old and new; (3) backfill; (4) make new column not-null; (5) stop writing to old; (6) drop old. Any migration that requires a lock on a large table will cause an outage — strong candidates know about online DDL, gh-ost for MySQL, pg_repack for Postgres.

30. A microservice is a shared dependency for 20 other services. How do you deploy changes to it safely?

What the interviewer grades: API contract thinking + blast radius awareness. Strategy: versioned APIs (don't break old consumers), consumer-driven contract tests (Pact), canary deploys with traffic splitting, observability on downstream error rates during rollout, rollback plan. Candidates who answer "just deploy and watch" haven't operated a true shared dependency.

Culture and Collaboration

31. How do you get developers to write better runbooks and playbooks?

What the interviewer grades: Organizational influence without authority. They want: runbook-as-code (stored with the service, updated with the code), making runbooks required for on-call rotation (no runbook = you're writing one at 3am), gamification/recognition, and building runbook creation into the post-incident action items. "Send a Slack message asking people to write runbooks" doesn't scale.

32. How do you build a culture where developers feel responsible for production?

What the interviewer grades: "You build it, you run it" philosophy. Concrete levers: developers are in the on-call rotation (even if tiered), deploy dashboards visible to the team, blameless post-mortems that don't punish risk-taking, error budgets that create shared incentives. The key insight: developers who never see their own outages don't build reliable software.

33. Describe how you'd introduce DevOps practices to a team that's still manually deploying.

What the interviewer grades: Change management pragmatism. The answer isn't "implement everything at once." Strong candidates: start with CI (automated testing on every PR — quick win, low risk), then CD to staging (builds confidence), then production CD. They identify the highest-risk manual step first and automate it. They also mention buy-in: showing the team a fast feedback loop (5-min test run vs. 30-min manual) creates its own momentum.

Interview Wildcards

34. What's the hardest on-call incident you've ever worked? What made it hard?

What the interviewer grades: Real production depth. They're not grading your war story — they're listening for: how you stayed calm under pressure, how you organized the response, how you communicated to stakeholders, and what you changed afterward. Candidates who describe simple incidents (one service crashed, I restarted it) haven't been through the hard ones. Candidates who blame others for hard incidents fail on ownership.

35. How do you stay current with DevOps tooling without chasing every new thing?

What the interviewer grades: Learning discipline and judgment. "I follow newsletters" is table stakes. Strong answer: filter by signal (what are reliability-focused teams at companies I respect adopting?), evaluate by problem fit (does this solve a real problem I have?), time-box proof-of-concepts, and only adopt when the switch cost is lower than the ongoing pain of the current tool.

36. How do you prioritize reliability work against feature velocity pressure?

What the interviewer grades: Stakeholder communication + conviction. The SRE answer: error budget policy (if budget is green, ship features; if red, reliability work gets priority — the budget makes it objective, not a negotiation). The non-SRE answer: frame reliability as a business risk ("the last incident cost us X hours of eng time and Y in revenue") to get engineering leadership buy-in. Candidates who always capitulate to "just ship" haven't advocated for reliability effectively.

37. What would you do differently in your current infrastructure if you could start over?

What the interviewer grades: Critical self-reflection and architectural wisdom. "Nothing, it's great" is a red flag. They want specific answers: "I'd have invested in a service mesh earlier," "I'd have built cost tagging from day one," "I'd have standardized on one logging format across all services instead of letting each team choose." This question reveals whether you've learned from your own decisions.

38. How do you handle a situation where you disagree with an engineering decision that affects reliability?

What the interviewer grades: Influence and conviction without hierarchy. Expected: you raise the concern with data (what's the risk, how do you quantify it), propose alternatives, escalate through appropriate channels if overruled on something high-stakes. What they don't want: "I just go along with it" (no backbone) or "I refuse to ship it until they change it" (not pragmatic).

39. Your team wants to adopt Kubernetes, but your ops maturity is low. Do you do it?

What the interviewer grades: Honesty about operational risk. The honest answer for most teams: Kubernetes has a steep operational overhead (especially stateful workloads, networking, upgrades). Managed services (EKS, GKE, AKS) reduce the burden but don't eliminate it. Strong candidates ask: what problem are we actually solving? Is it container orchestration, or scaling, or deployment consistency? Sometimes the right answer is ECS, Cloud Run, or even a managed PaaS — not Kubernetes.

40. What does "done" mean for a DevOps team?

What the interviewer grades: Philosophy of the discipline. The answer isn't "when the pipeline passes." It's: code is deployed, running in production, monitored with alerts and dashboards, documented with runbooks, tested against failure modes, and the SLO is green. Strong candidates tie "done" to production reliability, not just successful deployment.

FAQ

What's the hardest part of a DevOps interview? The hardest part is demonstrating that you've operated systems under real pressure, not just configured tools in a safe environment. Interviewers can tell the difference because they ask follow-up questions: "What happened when X failed?" or "What would you have done differently?" Candidates without real production experience run out of specific answers quickly. The prep that matters most: review your actual incidents and document them as stories you can tell.

How long does it take to prepare for a DevOps interview? A focused 4-week plan: Week 1 — review CI/CD and IaC fundamentals; Week 2 — Kubernetes operations (scheduling, debugging, networking); Week 3 — observability, SRE concepts, and incident response; Week 4 — system design and architecture trade-offs. Run one mock interview per week where you talk through a real system you've built.

What's the difference between a DevOps engineer and an SRE? In practice, the roles overlap heavily at most companies. The classic distinction: DevOps engineers focus on CI/CD, tooling, and developer productivity; SREs focus on production reliability, error budgets, and service SLOs. At FAANG companies, SRE is a distinct role with a software engineering interview bar. At smaller companies, a "DevOps engineer" often does both. Prepare for SLO/error-budget questions regardless of the title.

Are Kubernetes questions always in DevOps interviews? At most companies hiring in 2026, yes — Kubernetes has become the default orchestration layer and is fair game for any DevOps or platform engineer role. The depth varies: a startup might ask operational basics; a FAANG SRE team might go deep on scheduler internals, custom controllers, or multi-cluster federation. Know the operational layer (scheduling, debugging, networking) thoroughly, and know the architecture conceptually.

What salary should I expect as a DevOps engineer in 2026? US ranges: mid-level DevOps engineer $130K–$180K; senior DevOps/Platform engineer $180K–$250K; Staff/Principal $250K–$350K+ at large tech companies. FAANG SRE roles at the top end include significant RSU grants. Practice salary negotiation alongside technical prep — the two compound.

Free gets you ready. Pro gets you sharp.

Reading this guide is the start — the reps are where offers are won. Free gives you unlimited mock interviews, the full 8,675 real interview questions across 23 languages, and the AI Study Coach, no credit card. Pro ($10/mo) adds live voice interviews with Zaheen, the AI coach who asks follow-ups, pushes back, and scores you like a real interviewer — plus unlimited sessions.

See what Pro adds → $10/mo

7-day money-back guarantee · cancel anytime