Become a part of our caring community and help us put health first Humana is seeking an experienced Senior Tech Lead for our Site Reliability Engineering (SRE) team. This role will champion reliability, scalability, and performance of our critical systems. The ideal candidate will demonstrate strong technical leadership, mentor team members, and collaborate across engineering and business units to drive best practices in reliability and DevOps. Job Role + Responsibilities: Lead SRE team initiatives focused on system reliability, automation, and operational excellence. Architect and implement solutions to enhance availability, performance, and scalability of cloud and on-premises services. Oversee incident management processes, ensuring timely response and thorough root cause analysis. Develop and refine monitoring, alerting, and reporting frameworks; ensure actionable insights for service health. Guide adoption of Infrastructure as Code (IaC) and CI/CD pipelines to streamline deployments and reduce risk. Collaborate with software engineering and product teams to integrate reliability requirements into design and development. Mentor engineers on SRE principles, fostering a culture of continuous improvement and operational rigor. Establish service level objectives (SLOs), service level indicators (SLIs), and error budgets in partnership with stakeholders. Manage on-call rotations, ensuring effective coverage and knowledge sharing. Document architectural decisions, operational procedures, and incident retrospectives. Operational Excellence for AI Systems – Identifying AI/ML Use Cases, Influence and implement SRE best practices including SLIs/SLOs for ML workloads, automated remediation, capacity modeling. Observability & Monitoring for ML - Define and implement monitoring strategies for model drift, data anomalies, pipeline failures, system performance, and user experience. Key responsibilities of this role include: Proactive risk identification and mitigation during deployments to ensure system stability. Ensure long-term stability through Technical Debt Maintaining observability and performance of critical pharmacy applications. Supporting timely restoration of services during outages, with 24/7 coverage to meet enterprise Service Level Agreements (SLAs). Driving incident response and root cause analysis to prevent recurrence and improve system resilience. Drive Operational Excellence for AI Systems Use your skills to make an impact
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level