IT Infrastructure Operations Engineer - Lead

Astreya•Atlanta, GA

About The Position

This role focuses on building and operating highly reliable infrastructure and automation supporting physical security systems. Infra Automation Engineer applies principles such as SLIs, SLOs, error budgets, and toil reduction to improve system resilience and operational efficiency. The role works closely with Google leadership to deliver secure, scalable, and automated infrastructure.

Requirements

Experience: 8+ years in Site Reliability Engineering, or Infrastructure Engineering with minimum 3 years in a technical leadership role managing engineering teams
Technical Proficiency: Strong hands-on experience with Linux/Windows server administration, Infrastructure-as-Code tools (Terraform, Ansible, Chef, Puppet), and scripting languages (Python, Bash, PowerShell)
SRE Practices: Deep understanding of principles including SLIs, SLOs, error budgets, toil elimination, and experience implementing observability stacks (Prometheus, Grafana, ELK, or Datadog)
Incident Management: Proven track record in leading incident response, conducting blameless post-mortems, and driving systemic reliability improvements across complex infrastructure environments
Networking & Security: Solid understanding of networking fundamentals, Cisco device administration, and experience with network automation protocols (NETCONF/RESTCONF) and security compliance frameworks
Leadership & Communication: Excellent communication and stakeholder management skills with demonstrated ability to mentor teams, manage backlogs, and balance competing priorities in

Responsibilities

Lead, mentor, and manage a team of Automation Engineers, fostering a culture of ownership, collaboration, and continuous improvement
Partner with client IT leadership to define, implement, and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical infrastructure services
Manage error budgets to maintain optimal balance between feature development velocity and system stability
Act as the primary escalation point for severe incidents (Sev 1/2) and ensure effective incident management, communication, and resolution
Facilitate blameless post-mortem analysis for all major incidents and drive systemic improvements in tooling and infrastructure resilience
Manage the team´s project backlog, prioritize work, and ensure balance between reliability engineering and toil reduction initiatives
Drive automation strategies to reduce manual operational tasks with measurable targets (e.g., 50% reduction in manual server build time)
Oversee Infrastructure-as-Code implementations using Ansible, Terraform, Puppet, or Chef for configuration management and drift remediation
Ensure robust observability through standardized monitoring, alerting, and centralized logging across all managed infrastructure
Manage 24x5 on-call rotations and ensure adequate team coverage for incident response and support
Collaborate with cross-functional stakeholders to gather requirements, define project scope, and integrate SRE practices into existing workflows
Drive Mean Time To Repair (MTTR) reduction through automation, improved runbooks, and proactive reliability engineering

Benefits

Medical provided through UHC (PPO, HSA, Surest options) / Medical provided through Kaiser (HMO option only) for California employees only
Dental provided through UHC Nationwide
Vision provided by UHC
Flexible Spending Account for Health & Dependent Care
Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)
Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera
Corporate Wellness Program provided by Goomi Group
Employee Assistance Program
Wellness Days
401k Plan
Basic and Supplemental Life Insurance
Short Term & Long Term Disability
Critical Illness, Critical Hospital, and Voluntary Accident Insurance
Tuition Reimbursement (available 6 months after start date, capped)
Paid Time Off (accrued and prorated, maximum of 120 hours annually)
Paid Holidays
Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law