Senior Tech Lead – SRE

Humana

20h•Remote

About The Position

Become a part of our caring community and help us put health first Humana is seeking an experienced Senior Tech Lead for our Site Reliability Engineering (SRE) team. This role will champion reliability, scalability, and performance of our critical systems. The ideal candidate will demonstrate strong technical leadership, mentor team members, and collaborate across engineering and business units to drive best practices in reliability and DevOps. Job Role + Responsibilities: Lead SRE team initiatives focused on system reliability, automation, and operational excellence. Architect and implement solutions to enhance availability, performance, and scalability of cloud and on-premises services. Oversee incident management processes, ensuring timely response and thorough root cause analysis. Develop and refine monitoring, alerting, and reporting frameworks; ensure actionable insights for service health. Guide adoption of Infrastructure as Code (IaC) and CI/CD pipelines to streamline deployments and reduce risk. Collaborate with software engineering and product teams to integrate reliability requirements into design and development. Mentor engineers on SRE principles, fostering a culture of continuous improvement and operational rigor. Establish service level objectives (SLOs), service level indicators (SLIs), and error budgets in partnership with stakeholders. Manage on-call rotations, ensuring effective coverage and knowledge sharing. Document architectural decisions, operational procedures, and incident retrospectives. Operational Excellence for AI Systems – Identifying AI/ML Use Cases, Influence and implement SRE best practices including SLIs/SLOs for ML workloads, automated remediation, capacity modeling. Observability & Monitoring for ML - Define and implement monitoring strategies for model drift, data anomalies, pipeline failures, system performance, and user experience. Key responsibilities of this role include: Proactive risk identification and mitigation during deployments to ensure system stability. Ensure long-term stability through Technical Debt Maintaining observability and performance of critical pharmacy applications. Supporting timely restoration of services during outages, with 24/7 coverage to meet enterprise Service Level Agreements (SLAs). Driving incident response and root cause analysis to prevent recurrence and improve system resilience. Drive Operational Excellence for AI Systems Use your skills to make an impact

Requirements

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
7+ years of relevant experience in SRE, DevOps, or software engineering, including 2+ years in a technical leadership role.
Minimum 5 years' relevant experience with Python, Pyspark, Azure Databricks, Snowflake, SQL, ORACLE, POSTGRES, File Transfer, REST API, and KAFKA
Proficiency with cloud platforms (AWS, Azure, GCP), container orchestration, and automation tools.
Strong scripting and programming skills (e.g., Python, Go, Bash).
Deep understanding of distributed systems, networking, and security principles.
Proven experience leading large-scale incident response and postmortem processes.
Excellent communication and stakeholder management skills.
Experience building automation around: CI/CD (ADO YAML pipelines), Testing and validation.

Nice To Haves

Experience in regulated industries (healthcare, finance, etc.).
Certifications in cloud or DevOps technologies.
Familiarity with observability tools (Datadog, Prometheus, Grafana, etc.)

Responsibilities

Lead SRE team initiatives focused on system reliability, automation, and operational excellence.
Architect and implement solutions to enhance availability, performance, and scalability of cloud and on-premises services.
Oversee incident management processes, ensuring timely response and thorough root cause analysis.
Develop and refine monitoring, alerting, and reporting frameworks; ensure actionable insights for service health.
Guide adoption of Infrastructure as Code (IaC) and CI/CD pipelines to streamline deployments and reduce risk.
Collaborate with software engineering and product teams to integrate reliability requirements into design and development.
Mentor engineers on SRE principles, fostering a culture of continuous improvement and operational rigor.
Establish service level objectives (SLOs), service level indicators (SLIs), and error budgets in partnership with stakeholders.
Manage on-call rotations, ensuring effective coverage and knowledge sharing.
Document architectural decisions, operational procedures, and incident retrospectives.
Identifying AI/ML Use Cases, Influence and implement SRE best practices including SLIs/SLOs for ML workloads, automated remediation, capacity modeling.
Define and implement monitoring strategies for model drift, data anomalies, pipeline failures, system performance, and user experience.
Proactive risk identification and mitigation during deployments to ensure system stability.
Ensure long-term stability through Technical Debt Maintaining observability and performance of critical pharmacy applications.
Supporting timely restoration of services during outages, with 24/7 coverage to meet enterprise Service Level Agreements (SLAs).
Driving incident response and root cause analysis to prevent recurrence and improve system resilience.

Benefits

Humana provides medical, dental and vision benefits, 401(k) retirement savings plan, time off (including paid time off, company and personal holidays, volunteer time off, paid parental and caregiver leave), short-term and long-term disability, life insurance and many other opportunities.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume