Engineer, SRE GenAI

T-MobileBellevue, WA
10d

About The Position

As an Engineer in Site Reliability Engineering (SRE) for AI Systems, you will help ensure the reliability, scalability, and performance of AI platforms. This role includes participating in on-call rotations, improving system observability, and supporting operations across cloud-native infrastructure. This is a hands-on role ideal for someone with foundational SRE skills and a growth mindset to expand in GenAI and LLM infrastructure operations. We pride ourselves on encouraging a culture of innovation, advocating for agile methodologies, and promoting transparency in all that we do. Join us in embodying the spirit of the 'Un-carrier' and make a tangible impact! Our team is dynamic where no day is the same, and we are diverse and inclusive passionate about growth and transformation. If you're up to the challenge, apply today!

Requirements

  • Bachelor's Degree Computer Science, Engineering or a related field (Required)
  • 2–4 years of experience in DevOps, SRE, or cloud platform engineering.
  • Hands-on experience with monitoring/logging systems such as Prometheus, Grafana, Splunk, or OpenSearch.
  • Familiarity with cloud environments (preferably Azure; AWS/GCP a plus).
  • Experience in scripting or automation using Python, Bash, or PowerShell.
  • Basic understanding of containerization (Docker, Kubernetes) and CI/CD concepts.
  • Willingness to participate in an on-call schedule and incident resolution.
  • Strong solving and root cause analysis skills.
  • Communication (Required)
  • Customer Service (Required)
  • Analytics (Required)
  • Technical Writing (Required)
  • At least 18 years of age
  • Legally authorized to work in the United States

Nice To Haves

  • Exposure to AI/ML infrastructure or LLM-based systems (e.g., OpenAI, ChatGPT, Azure OpenAI).
  • Experience with infrastructure-as-code tools like Terraform or ARM templates.
  • Familiarity with LLM observability or API token usage metrics.
  • Passion for learning AI reliability practices and collaborating with cross-functional teams.

Responsibilities

  • Participate in on-call rotations to support AI platforms and respond to production incidents with urgency and precision.
  • Monitor system health and performance using tools like Grafana, Splunk, and PowerBI.
  • Support cloud-native infrastructure deployments, with a focus on Azure (primary), and exposure to AWS or GCP.
  • Implement runbooks and automate repetitive operational tasks to reduce toil.
  • Support CI/CD pipelines and IaC deployments using Gitlab pipelines, Databricks.
  • Assist in the development and enforcement of Service Level Objectives (SLOs) and real-time alerts for AI APIs and services.
  • Collaborate with senior engineers to improve platform reliability and scale LLM-based applications.

Benefits

  • Employees enjoy multiple wealth-building opportunities through our annual stock grant, employee stock purchase plan, 401(k), and access to free, year-round money coaches.
  • Employees in regular, non-temporary roles are eligible for an annual bonus or periodic sales incentive or bonus, based on their role.
  • Most Corporate employees are eligible for a year-end bonus based on company and/or individual performance and which is set at a percentage of the employee’s eligible earnings in the prior year.
  • medical, dental and vision insurance, a flexible spending account, 401(k), employee stock grants, employee stock purchase plan, paid time off and up to 12 paid holidays - which total about 4 weeks for new full-time employees and about 2.5 weeks for new part-time employees annually - paid parental and family leave, family building benefits, back-up care, enhanced family support, childcare subsidy, tuition assistance, college coaching, short- and long-term disability, voluntary AD&D coverage, voluntary accident coverage, voluntary life insurance, voluntary disability insurance, and voluntary long-term care insurance.
  • eligible employees can also receive mobile service & home internet discounts, pet insurance, and access to commuter and transit programs!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service