Lead Site Reliability Engineer - Data Platforms

Jobgether
71d$125,000 - $162,000

About The Position

We are seeking a Lead Site Reliability Engineer to manage and optimize data platform operations, ensuring high availability, scalability, and performance. In this role, you will oversee cloud infrastructure, end-to-end data pipelines, and containerized applications while collaborating closely with data science, ML/GenAI, and development teams. You will implement Infrastructure as Code, monitor systems for observability, and troubleshoot complex issues to maintain operational excellence. The ideal candidate thrives in a fast-paced environment, embraces automation, and drives innovation across cloud and data platforms. This role combines hands-on technical expertise with strategic system design, delivering measurable impact on business-critical data workflows.

Requirements

  • 8+ years of experience with Big Data technologies, data pipelines, and Linux administration.
  • Strong scripting proficiency in Bash or Python.
  • 5+ years managing cloud platforms (AWS, Azure) with hands-on experience in ECS, EKS, AKS, Terraform, Helm.
  • Experience with Infrastructure as Code, CI/CD tools (Chef, Ansible, Jenkins), and version control systems (Git).
  • Familiarity with Generative AI platforms (SageMaker, Bedrock, Azure ML) and vector databases.
  • Solid knowledge of networking (DNS, load balancers), MySQL, Apache Spark, and BI/data lake platforms.
  • Excellent communication skills, self-driven, capable of independently resolving complex issues, and delivering projects on time.
  • Strong interest in AI technologies and continuous improvement of operational practices.

Responsibilities

  • Manage cloud-based infrastructure, including AWS services (S3, EMR, Redshift) and containerized environments (ECS, Docker), to support data pipelines and ML/GenAI workloads.
  • Design, deploy, and maintain automated infrastructure using tools like Terraform, Chef, Ansible, and CI/CD pipelines.
  • Monitor and enhance observability across data systems, applications, and platforms.
  • Collaborate with engineering and ML teams to optimize the performance, reliability, and scalability of data and AI systems.
  • Participate in code/design reviews, troubleshoot complex system issues, and document root cause analyses (RCAs).
  • Support release planning, on-call rotation, and problem resolution to ensure uninterrupted data operations.

Benefits

  • Competitive salary ($125,000 - $162,000), based on experience, skills, and location.
  • Fully remote work with flexibility to balance personal and professional life.
  • Comprehensive healthcare and benefits package.
  • Opportunities to work with cutting-edge Big Data, ML, and GenAI technologies.
  • Professional growth through collaboration with cross-functional global teams.
  • Supportive, inclusive, and innovation-driven company culture.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service