Senior Software Engineer, Infrastructure

Afresh

41d•Remote

About The Position

As a Senior Software Engineer, Infrastructure, you will be a key member of Afresh’s Infrastructure engineering team. You will build and improve the infrastructure and tooling that helps our service-owning teams ship reliably, operate safely, and move quickly. On this team, you will: Own and deliver infrastructure projects end-to-end, from problem definition and technical design through implementation, rollout, and iteration Build and improve platform primitives that make it easier for service teams to deploy, operate, and debug their services Improve observability and operational readiness so we can detect issues early, reduce time-to-recovery, and prevent repeat incidents Identify and implement cost and performance improvements across our cloud infrastructure and developer tooling Work closely with Security to implement practical security controls and protect sensitive data (for example, least-privilege access, secret management, and network controls) Participate in our on-call rotation and continuously improve monitoring and alerting to maintain a low page rate Stay current on infrastructure best practices and evaluate improvements with a pragmatic, impact-focused mindset

Requirements

5+ years of relevant software engineering experience (or equivalent experience)
Cloud infrastructure — You have operated and maintained mission-critical cloud infrastructure with high uptime. You can design and implement scalable infrastructure (Azure preferred, but AWS/GCP are also fine), including core cloud networking (VPC/VNet design, routing, DNS, load balancing, and connectivity), and you can build improvements that make it easier for service owners to manage their own systems.
Incident response / disaster recovery — You have led or played a key role in high-severity production incidents. You can troubleshoot complex issues, restore service, and communicate clearly with stakeholders. You write and maintain runbooks and playbooks to reduce MTTR and avoid reliance on individual subject-matter experts.
Proficiency in Terraform — Strong experience writing, maintaining, and operating production Terraform codebases.
Proficiency in at least one general-purpose programming language — Beyond IaC, you can solve problems by writing and maintaining production code (Python preferred, but others are fine).
Proficiency with Kubernetes — You can operate and troubleshoot workloads in a Kubernetes cluster and help maintain what we have in place.
AI-assisted development — You actively use AI coding assistants and are comfortable integrating LLM-based tooling into infrastructure workflows (code generation, log analysis, runbook automation). You stay curious about new AI capabilities and look for practical ways to apply them.
Startup mindset — You prioritize effectively, stay focused on impact, and are comfortable with ambiguity. You can make progress quickly while maintaining a high bar for quality and reliability.
Relentless delivery focus — You make commitments and deliver. You surface risks early, communicate clearly when tradeoffs are needed, and keep momentum without relying on personal heroics.
Collaborative teammate — You build strong working relationships, incorporate feedback, and help unblock others. You provide mentorship through code reviews, pairing, and sharing context, and you raise the bar through example.
Strong self-management — You invest in your own growth, maintain healthy boundaries, and use time off appropriately.
Project leadership — You can drive a project or well-scoped initiative: align with partners on requirements and success criteria, write or contribute to technical designs, and coordinate execution through launch.
Customer partnership — You communicate well with partner teams, seek to understand their needs, validate solutions early, and know when and how to push back constructively when tradeoffs are required.

Nice To Haves

experience implementing automation to reduce manual intervention.

Responsibilities

Own and deliver infrastructure projects end-to-end, from problem definition and technical design through implementation, rollout, and iteration
Build and improve platform primitives that make it easier for service teams to deploy, operate, and debug their services
Improve observability and operational readiness so we can detect issues early, reduce time-to-recovery, and prevent repeat incidents
Identify and implement cost and performance improvements across our cloud infrastructure and developer tooling
Work closely with Security to implement practical security controls and protect sensitive data (for example, least-privilege access, secret management, and network controls)
Participate in our on-call rotation and continuously improve monitoring and alerting to maintain a low page rate
Stay current on infrastructure best practices and evaluate improvements with a pragmatic, impact-focused mindset

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume