About The Position

We are seeking a highly skilled and proactive Senior Specialist, Site Reliability Engineering (SRE) to help drive reliability, scalability, and performance across our critical platforms. This role is ideal for a senior-level engineer who combines deep technical expertise with a passion for automation, observability, and operational excellence. As a Senior Specialist, you'll work on complex reliability challenges, lead technical initiatives, and collaborate across engineering, product, and infrastructure teams to ensure our systems are resilient and efficient.

Requirements

  • 7+ years of experience in SRE, DevOps, or infrastructure engineering.
  • Deep expertise in cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
  • Strong proficiency in observability tools (e.g., Prometheus, Grafana, Splunk) and CI/CD pipelines.
  • Proven track record of solving complex reliability challenges in distributed systems.
  • Excellent communication and collaboration skills.
  • Experience in Python, Powershell, or other similar languages
  • Active use of artificial intelligence (AI) tools and techniques to enhance performance, drive innovation, and improve decision-making across business functions
  • Ability to leverage AI tools and platforms to streamline workflows, improve decision-making, and drive innovation
  • Curiosity and adaptability in exploring emerging AI technologies, with a mindset for continuous learning and experimentation

Nice To Haves

  • Experience in regulated or high-availability environments (e.g., financial services, healthcare).
  • Familiarity with chaos engineering, performance tuning, and capacity planning.
  • Background in software development with strong coding skills (e.g., Python, Go, Bash).

Responsibilities

  • Reliability Engineering
  • Architect and implement solutions to improve system reliability, scalability, and performance.
  • Define and manage SLIs/SLOs and error budgets across services.
  • Lead efforts to automate operational tasks and improve system observability.
  • Incident Management & Root Cause Analysis
  • Serve as a technical lead during major incidents and drive resolution.
  • Conduct deep root cause analyses and implement long-term fixes.
  • Champion blameless postmortems and continuous improvement.
  • Technical Leadership
  • Lead cross-functional reliability initiatives and mentor junior engineers.
  • Influence system design and architecture to embed reliability from the ground up.
  • Collaborate with software engineers to optimize deployment pipelines and infrastructure.
  • Monitoring & Tooling
  • Enhance observability through metrics, logging, and tracing.
  • Develop and maintain dashboards, alerts, and automated recovery systems.

Benefits

  • Competitive total rewards (base salary + bonus, if applicable)
  • Customizable benefits package (3 medical plans with Health Saving Account company match)
  • We offer generous paid time off for our non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays. We also offer flexible time off for our exempt team members + 13 paid holidays
  • Paid parental leave (including maternity + paternity leave)
  • Education assistance opportunities and free LinkedIn Learning access
  • Free mental health and family planning programs, including adoption assistance and fertility support
  • 401(K) program with company match
  • Pet insurance
  • Employee resource groups

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Professional, Scientific, and Technical Services

Education Level

No Education Listed

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service