Infrastructure Engineer

Sight MachineSan Francisco, CA

About The Position

About the role Most infrastructure roles ask you to maintain what exists. This one asks you to rethink how it should work. Sight Machine runs AI and analytics systems on top of some of the most demanding industrial data on the planet: millions of sensor events per minute, dozens of interconnected machines, production environments where downtime has real consequences. The infrastructure holding all of that together is not a solved problem. It is a living system, and we are looking for an engineer who sees that as an opportunity rather than a burden. As an Infrastructure Engineer, you will own the systems that deploy, monitor, and operate Sight Machine's cloud platform across a global customer base. You will be deep in Kubernetes, Terraform, and CI/CD pipelines, and you will bring AI into that work not as an afterthought but as a core part of how you reduce toil, accelerate automation, and stay ahead of failure. We are not looking for someone who will hold the line. We are looking for someone who will move it. This is a role for an experienced infrastructure engineer who has made real mistakes in production, learned from them, and built the instincts that only come from that. If you have also been pushing AI tools into your workflow in ways that actually change what you can ship, we want to talk. What You’ll Actually Work On In your first year, you can expect to work on problems like these: Owning and evolving our Kubernetes-based cloud infrastructure across Azure and other providers, including fleet management, networking, and cluster operations at scale. Designing and implementing CI/CD pipelines that let the engineering team ship faster and with more confidence, including automated testing, progressive delivery, and rollback capability. Building AI-assisted automation for operational tasks: runbook generation, anomaly triage, alerting logic, and anywhere else we can eliminate repetitive human intervention without sacrificing control. Driving Infrastructure as Code discipline across the platform (Terraform, Helm, FluxCD) so that every environment is reproducible, auditable, and fast to recover. Building and maintaining monitoring and observability infrastructure that gives the team real signal across our stack, from container health to database performance to customer-facing SLAs. Participating in on-call rotation and using every incident as a forcing function to improve the system: better runbooks, better alerting, better automation. Collaborating closely with Development Engineering to close the gap between what gets built and what gets operated well in production. You will work across a mix of mature systems and active greenfield development. Both require care. We want engineers who can operate what exists reliably while finding the leverage points to make it better. What We’re Looking For We care more about what you have operated than where you have worked. Here is what actually matters:

Requirements

  • 5+ years of professional infrastructure or DevOps engineering experience, with at least some of that at meaningful scale in a cloud-native environment.
  • Deep hands-on experience with Kubernetes and Docker in at least one major cloud provider (Azure, GCP, AWS). You have run clusters in production and have the scars to prove it.
  • Strong IaC fluency with Terraform, Helm, FluxCD, or similar. You write infrastructure the way developers write code: versioned, reviewed, and tested.
  • Real fluency with AI development tools. Not just autocomplete. You have used AI to write automation scripts, draft runbooks, accelerate incident triage, or build internal tooling. Show us how it has actually changed your output.
  • Solid coding ability in at least one scripting or systems language (Python, Go, or similar). You write tools, not just configs.
  • Strong Linux fundamentals and a working knowledge of networking: TCP/IP, DNS, load balancing, and how things break when they should not.
  • Experience with monitoring and alerting stacks: Prometheus, Sentry, Opsgenie, or equivalent. You build observability that gives people real signal, not noise.
  • A track record of on-call participation and a philosophy around incident response that leads to improvement, not just resolution.
  • Clear, direct communication. You can write a postmortem, a runbook, or a design doc that people actually read.
  • A bias for action. You have made decisions under uncertainty, taken the risk, and adjusted when you were wrong. Endless planning is not your style.

Nice To Haves

  • Familiarity with our current stack: Kubernetes, FluxCD, Terraform, Helm, Prometheus, Elasticsearch, Kafka, PostgreSQL, Jenkins.
  • Experience with Python and Java in the context of platform tooling or automation.
  • Prior work in industrial IoT, manufacturing, or operational technology environments.
  • Experience managing infrastructure for multi-tenant SaaS platforms.
  • An active GitHub or open-source presence that shows how you approach technical problems when no one is watching.

Responsibilities

  • Owning and evolving our Kubernetes-based cloud infrastructure across Azure and other providers, including fleet management, networking, and cluster operations at scale.
  • Designing and implementing CI/CD pipelines that let the engineering team ship faster and with more confidence, including automated testing, progressive delivery, and rollback capability.
  • Building AI-assisted automation for operational tasks: runbook generation, anomaly triage, alerting logic, and anywhere else we can eliminate repetitive human intervention without sacrificing control.
  • Driving Infrastructure as Code discipline across the platform (Terraform, Helm, FluxCD) so that every environment is reproducible, auditable, and fast to recover.
  • Building and maintaining monitoring and observability infrastructure that gives the team real signal across our stack, from container health to database performance to customer-facing SLAs.
  • Participating in on-call rotation and using every incident as a forcing function to improve the system: better runbooks, better alerting, better automation.
  • Collaborating closely with Development Engineering to close the gap between what gets built and what gets operated well in production.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service