About The Position

This role focuses on designing, building, and operating highly available cloud applications and distributed systems that power intelligent spaces worldwide. You will lead efforts to enhance system reliability, scalability, and operational efficiency while enabling engineering teams to adopt cloud capabilities safely and effectively. The position emphasizes proactive reliability improvements, incident response leadership, and the development of reusable cloud patterns, automation, and guardrails. Working collaboratively across product, platform, QA, and security teams, you will influence engineering practices and ensure observability, resilience, and performance. This is a remote, highly distributed role with a culture that values trust, ownership, and customer impact. Occasional travel may be required to support business needs.

Requirements

  • 5+ years of experience in software engineering or SRE/DevOps supporting production systems.
  • Hands-on programming experience in C#, Python, Java, or Go for tooling and automation.
  • Deep expertise in at least one major cloud platform, with Azure strongly preferred.
  • Experience with Kubernetes and containerized workloads (AKS, Helm, Kustomize).
  • Skilled in Infrastructure as Code using Bicep or Terraform in complex cloud environments.
  • Familiarity with CI/CD pipelines, GitOps workflows, and change/release management practices.
  • Strong knowledge of observability platforms and monitoring strategies (e.g., Datadog, Prometheus).
  • Proven ability to lead incident response, perform root cause analysis, and implement systemic fixes.
  • Excellent communication, collaboration, and problem-solving skills with the ability to influence engineering teams.
  • Curiosity, adaptability, and a habit of continuously improving systems.

Nice To Haves

  • Experience with IoT platforms or distributed device networks.
  • Knowledge of disaster recovery, security, and compliance.
  • Experience defining SLAs, SLOs, and SLIs with product teams.
  • Familiarity with FinOps principles and cloud cost optimization strategies.

Responsibilities

  • Design, deploy, and operate reliable, scalable, and secure cloud infrastructure to support distributed applications.
  • Monitor service availability, performance, and operability, ensuring compliance with defined SLAs and SLOs.
  • Lead incident response, blameless postmortems, and continuous reliability improvements across systems.
  • Partner with development teams to implement cloud standards, guardrails, and automation to improve engineering velocity.
  • Define and evolve observability practices, metrics, and monitoring strategies for proactive issue resolution.
  • Collaborate with cross-functional teams including product, platform, QA, and security to ensure seamless deployment and operation.
  • Advocate for reliability, resiliency, and best practices throughout the software lifecycle.

Benefits

  • Competitive annual salary: $120,800 – $161,000, depending on experience and location.
  • Comprehensive healthcare, dental, and vision coverage.
  • 401(k) retirement plan with company contributions.
  • Opportunity for incentive-based compensation depending on role.
  • Flexible remote work environment with supportive distributed team culture.
  • Professional development opportunities and exposure to cutting-edge cloud technologies.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service