Senior CloudOps Engineer

CloudZeroBoston, MA

About The Position

CloudZero is growing fast. Our customer base is expanding, the data challenges we're solving are getting more complex, and the platform is scaling to match. As a CloudOps Engineer you'll be a force multiplier for our engineering organization, owning the performance, reliability, and observability of CloudZero's infrastructure and empowering teams to ship features that help customers understand and optimize their cloud spend. This is real infrastructure work at real scale, not a ticket-closing role or a console-clicking job. CloudZero processes billions of events daily across AWS, Azure, and GCP. Our customers rely on real-time, accurate cost data to make business-critical decisions, and any instability in our system impacts their planning. Built entirely on a unique serverless architecture with no EC2s or containers, our platform demands infrastructure that scales gracefully, fails predictably, and recovers automatically. If you thrive on hard operational problems, care deeply about reliability and performance, and want to see your work matter to customers in direct and measurable ways, this role was built for you.

Requirements

  • 3 to 5+ years of experience building and operating distributed systems in AWS
  • Strong skills in Python and Infrastructure as Code using Pulumi or Terraform
  • Hands-on experience with monitoring tools such as Prometheus or Datadog
  • Proven ability to debug production issues under pressure
  • Values thoughtful, reliable system design over reactive hero efforts
  • Strong documentation habits to support long-term team clarity and system stability
  • Ability to clearly explain complex technical issues to non-technical stakeholders
  • Excited to take ownership of infrastructure and solve operational challenges at scale

Nice To Haves

  • Experience with frontier AI models such as Claude, Codex, or Gemini

Responsibilities

  • Design and maintain Pulumi modules that provision reliable, cost-efficient cloud resources
  • Own infrastructure end to end with no clicking through consoles
  • Instrument systems so that failures surface quickly and debugging happens with data, not guesswork
  • Build observability into everything so you know about problems before customers do
  • Automate deployments, scaling, backups, and limit changes; if humans are doing it repeatedly, build a system to do it instead
  • Balance automation intelligently, building solutions to real problems rather than automating for its own sake
  • Help teams design resilient services, review architectures for operational complexity, and build deployment pipelines that enable safe and fast shipping
  • Optimize for cost and performance; CloudZero's business is helping others optimize cloud costs, and we should be exemplars of efficient cloud usage ourselves
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service