Infrastructure Engineer Interview Questions & Answers
Preparing for an Infrastructure Engineer interview can feel overwhelming—you’re not just walking into a conversation about your skills, you’re preparing to discuss how you’ll protect and scale the technological backbone of an organization. The questions you’ll face will span everything from your hands-on experience managing servers and networks to how you’d architect solutions for complex, real-world problems.
This guide gives you concrete infrastructure engineer interview questions and answers that reflect what you’ll actually encounter, plus strategies for thinking through tough technical scenarios, behavioral questions that let you showcase your problem-solving mindset, and questions you should ask to evaluate whether the role is right for you.
Common Infrastructure Engineer Interview Questions
What experience do you have with cloud platforms like AWS, Azure, or GCP?
Why they ask: Cloud expertise is now table stakes in infrastructure engineering. Hiring managers want to know which platforms you’ve worked with, how deeply, and whether you can navigate their specific environment.
Sample Answer: “I’ve spent the last three years working primarily with AWS. I manage EC2 instances, RDS databases, and S3 storage across multiple environments. In my last role, I orchestrated a migration of our on-premises infrastructure—about 50 VMs—into a hybrid setup, keeping some legacy systems on-prem while moving our web applications to AWS. I handled the networking piece, set up VPCs, security groups, and NAT gateways to keep traffic flowing securely between environments. I’ve also done some work with Azure when a client needed integration between their Microsoft stack and cloud resources, so I understand the conceptual overlaps but recognize each platform has its own quirks.”
Personalization Tip: Research the company’s cloud footprint before your interview. If they use AWS, dive into specifics—mention particular services you’ve used. If they’re multi-cloud, talk about how you’ve managed complexity across platforms.
How do you approach monitoring and alerting for infrastructure?
Why they ask: A reactive infrastructure engineer is constantly putting out fires. They want someone who prevents problems through proactive monitoring and escalates intelligently.
Sample Answer: “I use a multi-layer approach. For real-time metrics, I’ve implemented Prometheus to scrape system and application metrics, then visualize them in Grafana. The key is setting alerts that matter—not so sensitive you get alert fatigue, but sensitive enough to catch issues early. For example, I set CPU thresholds at 80% for gradual escalation and 95% for immediate alerts, and I monitor disk usage because running out of space is preventable but catastrophic. Beyond metrics, I integrate logs from applications using the ELK stack, which helps me spot patterns that raw metrics might miss. I also configure dependency tracking—if a database is down, I know immediately which services are affected rather than getting flooded with alerts from everything downstream.”
Personalization Tip: Mention specific tools you’ve actually used. If the company uses New Relic or Datadog instead of Prometheus, acknowledge you haven’t used their exact stack but explain your monitoring philosophy so it’s clear you’ll adapt quickly.
Describe your experience with Infrastructure as Code (IaC). What tools have you used?
Why they ask: IaC is how modern infrastructure gets built consistently and at scale. They’re assessing whether you can code your infrastructure, manage it in version control, and automate deployments.
Sample Answer: “I primarily use Terraform for IaC. I define infrastructure declaratively—networks, compute instances, databases—all in code, which gets version controlled in Git alongside our application code. This gives us reproducibility and audit trails. I’ve used it to spin up entire environments from scratch, which has been invaluable for testing disaster recovery scenarios without manual toil. I also have experience with CloudFormation on AWS projects, though I generally prefer Terraform’s cloud-agnostic approach when we’re building hybrid environments. Beyond templating, I’ve automated deployments through GitOps workflows—code changes trigger infrastructure updates automatically, which reduces manual errors and speeds up iteration.”
Personalization Tip: If you’ve only used CloudFormation, be honest about that but show you understand the principles. Mention that you’re comfortable picking up Terraform or other IaC tools because you grasp the underlying concepts.
How do you ensure high availability and disaster recovery?
Why they ask: Downtime costs money. This question reveals your philosophy on resilience and whether you’ve actually thought through “what if the entire data center fails?”
Sample Answer: “High availability and disaster recovery are different problems, so I tackle them separately. For HA, I use redundancy at every layer—multiple instances behind a load balancer, replicated databases, auto-scaling groups that spin up replacements if instances fail. I’ve deployed across multiple availability zones so a single zone’s failure doesn’t take us down. For disaster recovery, I establish RTO and RPO targets first—how quickly do we need to recover, and how much data can we afford to lose? Then I design backward from there. We run automated daily backups of databases and critical file systems, store them in geographically separate regions, and document the recovery procedures. The critical part: I actually test these recovery plans quarterly by doing disaster recovery drills. It’s revealed gaps every time, and it’s better to find them in a drill than during an actual outage.”
Personalization Tip: Use numbers where possible. “We achieve 99.95% uptime” is better than “high availability.” Mention specific RTO/RPO targets if you’ve worked with them.
Tell me about a time you had to troubleshoot a complex infrastructure issue. Walk me through your process.
Why they ask: This tests your analytical methodology and composure under pressure. They want to see logic, not luck.
Sample Answer: “Once, our application users started experiencing intermittent timeouts during peak traffic hours. I started by checking the obvious—was it the application itself? I reviewed app logs and didn’t see errors, so I looked at system metrics on the web servers. CPU and memory looked normal, so I dug into network metrics and noticed network throughput was occasionally spiking to near capacity. I traced it to the database server—queries were suddenly running slower, causing connection buildup. I checked database logs and found a query that used to run in milliseconds now taking 30 seconds. Turns out a recent data migration had changed table structure without updating indexes. I added the missing indexes, and response times normalized. What I did right: I didn’t assume—I systematically isolated the problem layer by layer. What I learned: I now have automated index health checks running weekly.”
Personalization Tip: Pick a real example. If you’re early in your career, talk about smaller issues—the methodology matters more than the scale. Show how you’d approach similar issues in the future.
How do you handle security in your infrastructure?
Why they asks: Security breaches are expensive and reputation-damaging. They need to know you don’t treat it as an afterthought.
Sample Answer: “Security is layered—I don’t rely on any single control. At the network level, I use security groups and NACLs to implement least privilege access, only allowing the specific ports and protocols needed. I enable encryption in transit (TLS) and at rest for sensitive data. For access control, I’ve moved away from shared passwords toward SSH keys with short-lived credentials, and I implement MFA wherever possible. I also run vulnerability scans regularly and stay on top of patching. In my last role, I worked with our security team to implement a secrets management system using HashiCorp Vault so database credentials and API keys aren’t hardcoded in configuration files. I also maintain audit logs and review them for suspicious activity. The mindset is: assume things will go wrong, and make sure you can detect and respond quickly.”
Personalization Tip: Ask during the interview prep phase what their security concerns are. If they’re healthcare or fintech, they’ll have compliance requirements you should mention understanding.
Describe your experience with containerization and orchestration technologies like Docker and Kubernetes.
Why they ask: Containers have become foundational. They want to know if you can manage containerized workloads and orchestration platforms.
Sample Answer: “I’ve used Docker extensively for packaging applications consistently across environments. I build images with specific base operating systems and dependencies, which eliminates the ‘it works on my machine’ problem. For orchestration, I’ve managed small Kubernetes clusters—maybe 5-10 nodes for internal services and side projects. I can write YAML manifests for deployments, services, and persistent volumes, and I understand concepts like namespaces, labels, and selectors. That said, Kubernetes is deep, and I’d say I’m competent for small to medium clusters but not yet at the level where I’m designing multi-region Kubernetes infrastructure. I’m actively learning more through personal projects and online courses. Docker I feel very solid with—I’ve built many production images and optimized them for size and security.”
Personalization Tip: Be honest about your depth. Hiring managers respect humility, and claiming deep Kubernetes expertise when you don’t have it will come out in technical conversations. Show you’re learning.
How do you stay current with infrastructure trends and new technologies?
Why they ask: Infrastructure engineering changes constantly. They want people who are genuinely curious and committed to learning.
Sample Answer: “I read infrastructure-focused newsletters like Last Week in AWS and Hacker News, and I follow several engineers on Twitter who share industry insights. Beyond passive reading, I do hands-on learning—I set up a small homelab where I experiment with new technologies before deciding whether they’re worth adopting. Recently, I completed a course on infrastructure automation using Ansible, which led me to propose implementing Ansible playbooks at work for system hardening, saving us significant time. I also attend local meetups when I can and watch conference talks from events like KubeCon and re:Invent. The key for me is balancing breadth—knowing what’s emerging—with depth—really understanding the tools I actually use.”
Personalization Tip: Reference specific resources and recent things you’ve learned. Vague answers like “I read blogs” don’t stand out.
How do you manage configuration management and deployments?
Why they ask: Configuration management is how you scale infrastructure without chaos. They want to know if you’re using tools or doing manual work.
Sample Answer: “I’ve used Ansible for configuration management—it’s agent-less and integrates well with Terraform in an Infrastructure as Code workflow. I write playbooks to configure servers consistently: installing packages, setting up monitoring agents, configuring firewalls. I store these in Git with version history, so we know exactly what changed and when. For deployments, I’ve built CI/CD pipelines using Jenkins and GitLab CI that automatically run tests, build artifacts, and deploy to staging and production. The goal is making deployments repeatable and lowering the risk of manual errors. I’ve also worked with Puppet in a previous role, which was more declarative. Both have the same core value—you define desired state and the tool enforces it.”
Personalization Tip: Match the tools you mention to the company’s stack if possible. If you don’t have direct experience with their tools, show you understand the concepts they’re built on.
What’s your approach to capacity planning?
Why they ask: Running out of disk space or CPU capacity leads to outages. They want someone who forecasts and plans ahead.
Sample Answer: “I use historical data and growth trends to forecast capacity. I pull metrics from our monitoring system—CPU, memory, disk, network—over time, usually the past 12 months, and identify trends. If we’re growing 10% month-over-month, I project forward six months and determine when we’ll hit 80% capacity, which is my signal to act. I’ve also set up auto-scaling in AWS so non-critical services scale automatically during traffic spikes, which handles short-term bumps without permanently increasing infrastructure. For databases, capacity planning is more manual—databases can’t just add disk space invisibly. I work with the DBA to monitor growth and provision additional storage before we hit limits. I also use this data to push back on over-provisioning; if we provision for a worst-case that never happens, we’re wasting budget.”
Personalization Tip: If you haven’t done formal capacity planning, explain how you’d approach it. The thinking matters more than past experience.
Describe a time you had to implement a significant infrastructure change or upgrade. How did you minimize downtime?
Why they ask: Major changes are risky. They want to see your planning, communication, and risk mitigation skills.
Sample Answer: “We upgraded our database cluster from PostgreSQL 11 to 13. The database runs 24/7, so downtime was unacceptable. I planned a rolling upgrade: I took one replica offline, upgraded it, tested it, then failed over the application to the upgraded replica. Then I upgraded the original primary. Total downtime was under 30 seconds during the failover. Before touching production, I tested the entire process on a staging environment that mirrored production—same data volume, same queries. I also communicated a maintenance window to the team with clear expectations about what might happen and how to verify everything was working. After the upgrade, I monitored performance closely for a week, comparing query times and resource usage to the old version.”
Personalization Tip: Choose an upgrade or change you’ve actually done. Walk through the specific steps you took to minimize risk.
How do you approach working with other teams—developers, security, operations?
Why they ask: Infrastructure doesn’t exist in a vacuum. They want to know if you can collaborate effectively.
Sample Answer: “I see infrastructure as a support function for what developers are building. When a developer asks for a new database or wants to add a service, I don’t just say ‘no’ or hand them a form. I try to understand what they’re trying to achieve, suggest options based on our infrastructure and constraints, and help them implement it. I’ve also built strong relationships with security—they tell me what compliance or security requirements matter for our industry, and I make sure those are baked into infrastructure from the start rather than bolted on later. With ops and other infrastructure engineers, I believe in documentation and knowledge sharing. When I implement something new, I document it so others can maintain it. I also make time to help junior engineers debug issues.”
Personalization Tip: Specific collaboration examples land better than generic statements. Mention a time you worked with another team to solve a problem.
Tell me about your experience with load balancing and traffic management.
Why they ask: Most modern infrastructure needs to distribute traffic across multiple servers for reliability and performance.
Sample Answer: “I’ve configured multiple types of load balancers depending on the use case. For Layer 4 (network level) load balancing, I’ve used AWS Network Load Balancers to distribute TCP/UDP traffic with very low latency. For Layer 7 (application level), I’ve used Application Load Balancers and also Nginx as a reverse proxy. The choice depends on what you’re optimizing for—NLB when you need ultra-high throughput, ALB when you want to route based on hostnames or URL paths. I’ve also implemented health checks so failed backends are automatically removed from the pool, and I’ve configured sticky sessions where needed for stateful applications. One thing I’ve learned: load balancer configuration isn’t set-and-forget. You have to monitor connection counts and latency to know if you need to adjust timeouts or add more backends.”
Personalization Tip: Mention specific load balancers you’ve used and why you chose them for particular scenarios.
How do you document infrastructure, and why do it?
Why they ask: Undocumented infrastructure is fragile. One person leaves and nobody else knows how to operate it.
Sample Answer: “I document infrastructure in multiple ways depending on the audience. For other engineers, I maintain runbooks—step-by-step guides for common tasks like deploying a new service or responding to specific alerts. I keep these in a Git repo or wiki so they stay current. I also diagram our architecture at a high level—VPCs, databases, services, how they connect—so new team members can grasp the topology quickly. For code, I comment on non-obvious infrastructure decisions: why we chose this particular architecture, what we tried that didn’t work, what assumptions we’re making. The thing is, documentation tends to rot, so I’ve found the best approach is keeping it in the same repo as the code it describes, so it’s version controlled and updated together.”
Personalization Tip: If you’ve used specific tools for documentation (Confluence, Notion, diagrams.net), mention them. Show that you value the practice even if you don’t have perfect examples.
What’s your experience with backup and recovery procedures?
Why they ask: Data loss is catastrophic. They need to know you have serious backup strategies, not just hope.
Sample Answer: “Backup strategy depends on what you’re protecting and your RPO. For databases, I implement continuous replication to a standby database in another availability zone, so if the primary fails, we failover to the replica with minimal data loss. I also take daily snapshots to S3 in a separate AWS region, which protects against regional outages or accidental deletion. For configuration and code, that’s version controlled in Git with backups to multiple remote repositories. I’ve tested recovery procedures—actually restored from backups to a test environment to verify they work and measure how long recovery takes. I’ve found that backup systems that have never been tested don’t work when you need them. I also monitor backup jobs; if a backup fails silently, you only discover it during a disaster.”
Personalization Tip: Share a specific example of a backup that actually saved you. That resonates more than backup theory.
Behavioral Interview Questions for Infrastructure Engineers
Behavioral questions use the STAR method (Situation, Task, Action, Result) to reveal how you’ve handled real challenges. These questions assess soft skills—teamwork, communication, adaptability—that matter as much as technical chops.
Tell me about a time you had to work on a team to solve a critical infrastructure problem.
Why they ask: Infrastructure issues rarely happen in isolation. They want to see how you collaborate under pressure.
STAR Framework:
- Situation: Set the scene with specific details. When did this happen? What was the infrastructure context?
- Task: What was your responsibility in solving this?
- Action: What did you actually do? Who did you work with? What was your process?
- Result: What was the outcome? What did you learn?
Sample Answer: “Two years ago, our primary database server became unresponsive during a peak traffic period. As the Infrastructure Engineer on call, I had to coordinate with the DBA team and application engineering. I immediately started pulling system metrics and noticed disk I/O was maxed out. I communicated findings to the DBA—they found a runaway query from a recent deployment. While they worked on killing that query and optimizing it, I coordinated with app engineering to roll back the problematic code. During this, I kept the team in a shared Slack channel providing real-time updates. We restored service in about 45 minutes. Afterward, I helped create a monitoring alert for high disk I/O and a runbook for this specific scenario, so if it happened again, the response would be faster.”
Personalization Tip: Highlight your communication and coordination role, not just the technical fix. Show that you made others’ jobs easier.
Describe a time you made a mistake in infrastructure. How did you handle it?
Why they ask: Everyone makes mistakes. They want to see if you own them, learn from them, and communicate effectively.
STAR Framework:
- Situation: What happened? What mistake did you make?
- Task: What was at stake?
- Action: Did you catch it? How did you inform others? What did you do to fix it?
- Result: What was the impact? What did you learn?
Sample Answer: “I once deleted a security group rule thinking it wasn’t being used, which broke database connectivity for a staging environment. I realized it immediately when I started seeing connection errors in logs. I could have quietly recreated the rule, but instead I immediately notified the team that this was my error and the ETA for fix. I restored the rule (took seconds), verified connectivity, then spent time tracing what actually used that security group to understand why it was there in the first place. Turned out the documentation was outdated, so I updated it. I also set up a read-only check on security group changes so another engineer reviews deletions before they happen. It was embarrassing, but treating it transparently rather than quietly fixing it built trust with the team.”
Personalization Tip: Pick a real mistake, not a contrived one. Honesty and the lessons you drew matter more than perfection.
Tell me about a time you had to learn a new technology quickly to solve a problem.
Why they ask: Technology moves fast. They want to see if you’re resourceful and can pick up new tools when needed.
STAR Framework:
- Situation: What tool or technology was unfamiliar to you?
- Task: Why did you need to learn it urgently?
- Action: How did you approach learning? What resources did you use?
- Result: Did you successfully implement it? What’s your comfort level now?
Sample Answer: “Our company decided to migrate to Kubernetes to handle container orchestration for our microservices, but I’d only used Docker before—no Kubernetes experience. We had a three-month timeline and I was responsible for building our initial cluster. I started with online courses on Udemy and Kubernetes documentation to understand core concepts—Pods, Services, Deployments. Then I built a test cluster in AWS using EKS, deployed a sample application, and broke things intentionally to understand how to fix them. I also attended a Kubernetes workshop at a local meetup. Three months later, I had designed and deployed our first production cluster with monitoring, logging, and auto-scaling. I’m not an expert, but I’m comfortable running and troubleshooting our Kubernetes infrastructure now. The key was not trying to learn everything at once—I focused on what mattered for our use case.”
Personalization Tip: Show your learning process—courses, hands-on labs, mentors. This reveals how you’ll approach the unfamiliar.
Describe a time you had to communicate complex technical information to non-technical stakeholders.
Why they ask: Infrastructure engineers often need to explain why an upgrade costs money or why we need to do maintenance. They want to see if you can be clear without oversimplifying.
STAR Framework:
- Situation: What technical concept did you need to explain?
- Task: Who were you explaining it to and why?
- Action: How did you break it down? What analogies or visuals did you use?
- Result: Did they understand? Did they make a decision or take action?
Sample Answer: “I had to explain to our CFO why we needed to spend $200K on a disaster recovery setup that we hopefully would never use. I could have talked about RTO and RPO, but instead I framed it as insurance. I showed data on how much an hour of downtime would cost us in lost revenue and customer impact, then explained that for $200K upfront and ongoing, we could recover from a regional outage in minutes instead of hours. I walked her through a scenario: if our primary data center in one region went offline, here’s what customers would experience with our current setup, and here’s what they’d experience with DR in place. I also explained that this wasn’t theoretical—it happened to a competitor last year. She approved the budget.”
Personalization Tip: Emphasize that you translated technical concepts into business impact—revenue, risk, time.
Tell me about a time you had to deal with conflicting priorities or requests from different teams.
Why they ask: Infrastructure has many stakeholders. They want to see if you can navigate trade-offs and say no diplomatically.
STAR Framework:
- Situation: What were the conflicting requests?
- Task: Why did you have to choose between them?
- Action: How did you decide? Did you communicate with both teams?
- Result: How did you resolve it?
Sample Answer: “The development team wanted a new staging environment with high specs to test load scenarios, and the security team wanted us to implement a new vulnerability scanning process that required infrastructure changes. Both were urgent, both had merit, and both would consume my time. Instead of just picking one, I sat down with both teams. Development’s staging need was actually more flexible than they initially said—they could share resources with another team’s staging. Security’s scanning was genuinely important for compliance. I proposed a phased approach: implement the security process this sprint since it was on a compliance timeline, then tackle the staging expansion next sprint once we had breathing room. Both teams understood the reasoning, and we maintained credibility by delivering both within a reasonable timeframe.”
Personalization Tip: Show that you gathered information, communicated clearly, and found win-wins rather than simply choosing a side.
Tell me about a time you improved an infrastructure process or system. What was the impact?
Why they ask: They want people who don’t just maintain infrastructure but improve it. This reveals your initiative and impact thinking.
STAR Framework:
- Situation: What was inefficient or broken?
- Task: Why did it matter enough to prioritize?
- Action: What did you change? How did you implement it?
- Result: What was the measurable impact?
Sample Answer: “We had a manual runbook for server provisioning that took 2-3 hours—selecting instance types, configuring storage, installing monitoring agents, setting up backups. This was error-prone because people would skip steps or do them differently. I automated it using Terraform and Ansible. Now, provisioning a new server is a single command. I also added guardrails—the automation enforces our tagging standards, security group configurations, and monitoring setup. The impact: new servers get provisioned in 5 minutes, configuration is consistent, and junior engineers can provision servers without fear of missing something. We’ve also saved countless hours that we spent on repetitive tasks.”
Personalization Tip: Quantify the impact if possible—time saved, errors reduced, faster deployments.
Technical Interview Questions for Infrastructure Engineers
These questions dig deeper into specific technical domains. Rather than memorizing answers, practice the thinking frameworks.
Walk me through how you would architect a highly available web application for a startup expecting to scale from 1,000 to 1 million users over the next two years.
Why they ask: This tests your systems thinking and ability to design for scale. There’s no single “right” answer—it’s about your reasoning.
Framework for Answering:
- Start with requirements: What does the app do? Is it read-heavy or write-heavy? What’s the acceptable downtime?
- Identify components: Load balancer, application servers, database, caching, static content, monitoring.
- Address high availability: Redundancy, failover, multiple availability zones.
- Plan for scale: How will each component handle 100x growth?
- Discuss trade-offs: Cost vs. complexity, over-provisioning vs. under-provisioning.
Sample Answer: “First, I’d understand the application requirements. Assuming it’s a typical web application, I’d start simple: single load balancer routing to multiple app servers behind it, a managed database like RDS, and CDN for static content. This handles the first phase. As we scale, I’d move the database to a multi-AZ setup with read replicas for read-heavy queries. I’d implement caching with Redis to reduce database load. I’d set up auto-scaling groups so the app tier scales automatically. I’d use a content distribution network for static assets. For observability, I’d implement centralized logging and monitoring from day one so I can see what’s breaking before it becomes a problem. I’d also plan for database growth—eventually we might need sharding if a single database can’t handle the write volume, but I’d cross that bridge when we get there. I’d design with cost in mind—not over-provisioning upfront, but building the ability to scale incrementally. Also critical: I’d architect so we can do deployments without downtime using rolling updates and health checks.”
Personalization Tip: Ask clarifying questions to show your thinking. “Is this a social media platform or a SaaS tool? That affects my recommendations.”
Design a disaster recovery solution for a database that currently has 2TB of data and processes 100K transactions per second. The company’s RTO is 1 hour and RPO is 15 minutes.
Why they ask: DR planning requires understanding trade-offs between consistency, availability, and cost.
Framework for Answering:
- Clarify the constraints: RTO of 1 hour means failover must be automated and practiced.
- RPO of 15 minutes means data loss up to 15 minutes is acceptable; continuous replication helps.
- Consider replication strategies: Synchronous (consistent but slower), asynchronous (faster but riskier).
- Evaluate failover automation: Manual, semi-automated, or fully automated?
- Discuss testing: DR only works if you’ve tested it.
Sample Answer: “For 2TB and 100K TPS with a 1-hour RTO and 15-minute RPO, I’d use continuous asynchronous replication to a standby database in another region. The primary database streams changes to the replica continuously. If the primary fails, we can failover to the replica in minutes, well within the 1-hour RTO. The 15-minute RPO is achievable with asynchronous replication—we might lose up to 15 minutes of uncommitted transactions, but that’s acceptable per requirements. I’d fully automate failover detection and triggering—if the primary stops responding, a monitoring system automatically fails over to the replica. I’d also run quarterly DR drills where we actually failover to the replica, verify it’s working, and fail back to the primary. This surface gaps before a real disaster. I’d also document the runbook. One critical thing: after failover, I need to ensure applications reconnect to the new primary, which usually requires DNS updates or connection string changes—I’d automate that too. The cost of this setup is significant—essentially paying for two full database instances—but given the RTO and RPO requirements, it’s justified.”
Personalization Tip: Acknowledge trade-offs. “Synchronous replication would give us zero RPO but would add latency to every transaction, which isn’t acceptable at 100K TPS.”
You notice CPU utilization is consistently at 85% on your application servers during peak hours. Walk me through how you’d diagnose and fix it.
Why they ask: This is a real problem you’ll face. They want to see your troubleshooting methodology.
Framework for Answering:
- Gather context: Is this normal? Is it causing problems? Or is it just high?
- Isolate the cause: Is it the application code, a specific process, or system overhead?
- Check application-level metrics: Request rate, request latency, error rate.
- Check system-level metrics: Process-level CPU, context switches, I/O wait.
- Develop solutions: Scale horizontally, optimize code, adjust configurations.
Sample Answer: “First, I’d gather context. Is this new, or has it always been this way? Is it causing customer impact—slow response times or errors? If it’s not causing problems, maybe 85% is acceptable and we just need to make sure it doesn’t spike higher. Assuming it’s new and causing slowdowns, I’d drill down. I’d look at application metrics—has request volume increased, or is each request using more CPU? I’d check for runaway processes using top or ps to see which process is consuming CPU. I’d also check system metrics like context switches and I/O wait. If I/O wait is high, it might not actually be the application—the server might be waiting on disk or network. Let’s say I discover a recent code change caused an inefficient database query. I’d work with the developer to optimize that query or add caching. If it’s sustained traffic growth, I might scale horizontally—add more servers behind the load balancer to distribute load. I’d also set auto-scaling policies so if CPU stays above 75% for five minutes, new servers automatically spin up. This prevents us from firefighting every spike.”
Personalization Tip: Show that you distinguish between problems that need optimization and problems that need scaling.
How would you implement a CI/CD pipeline for infrastructure changes (Infrastructure as Code)? Walk me through the process.
Why they ask: This reveals your approach to safe, automated infrastructure deployments.
Framework for Answering:
- Version control: Code in Git with pull requests.
- Testing: Validate syntax, security scanning, cost estimation.
- Staging: Deploy to a staging environment for testing.
- Approval: Human review before production.
- Deployment: Automated deployment to production.
- Rollback: Ability to quickly revert if needed.
Sample Answer: “I’d store all infrastructure code in Git. Developers create pull requests for infrastructure changes. In the PR, automated checks run: Terraform validate checks syntax, Tflint checks for style and best practices, and Checkov scans for security issues. This catches obvious mistakes before review. Once the PR is approved by another engineer, it’s merged to the main branch. The merge triggers a CI/CD pipeline—Terraform plan runs and generates a dry-run of what will change. This plan is reviewed—nobody wants surprises when deploying infrastructure. Once approved, terraform apply is executed, which actually deploys the infrastructure. All of this is logged and audited. If something goes wrong, we can rollback by reverting the commit and running apply again with the previous code. I’d also add cost estimation so we know upfront if this change will significantly increase AWS spend. This workflow makes infrastructure changes controlled and auditable, like code deployments.”
Personalization Tip: Mention specific tools—“Using GitLab CI” or “Jenkins with the Terraform plugin”—if you have them. Show you understand the workflow even if you haven’t done exact implementation.
Your team uses Terraform to manage infrastructure. You notice drift—what the Terraform state says exists doesn’t match what’s actually in AWS. How do you handle it?
Why they ask: Drift is a real problem in IaC environments. They want to know your recovery process.
Framework for Answering:
- Understand the cause: Manual changes? Different Terraform versions? Expired state file?
- Detect drift: Terraform plan shows unexpected changes.
- Decide on remediation: Bring Terraform code into sync with reality, or re-sync state from reality.
- Prevent recurrence: Enforce Terraform-only changes or implement drift detection.
Sample Answer: “Drift happens when infrastructure changes outside of Terraform—someone manually modifies a security group in the AWS console, or a service crashed and autoscaling spun up different instance types. When I detect drift, I have two options. One: update Terraform code to match reality and apply it. Two: destroy what’s in AWS and let Terraform recreate it correctly. The choice depends on what changed and whether there’s running data. If someone manually changed a security group, I update the Terraform code to reflect that change—we want Terraform to be the source of truth. If it’s transient infrastructure like a cache that got spun up, sometimes it’s easier to destroy it and let Terraform recreate it. To prevent drift, I prevent manual changes. I restrict IAM permissions so engineers can’t manually change production infrastructure—they have to go through Terraform. I also run terraform plan regularly, maybe daily, to detect drift early. I might also use Terraform Cloud’s state locking to prevent concurrent changes that cause inconsistency.”
**Personalization