About The Position

The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications. This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments. Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, DevOps, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.

Requirements

  • Bachelor’s degree in Computer Science, Information Systems, or related field; Master’s preferred
  • 10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role
  • Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management
  • Proven experience managing production-scale environments supporting mission-critical applications and services
  • Strong proficiency in: Infrastructure-as-code (Terraform, CloudFormation)
  • Strong proficiency in: CI/CD and DevOps toolchains (Jenkins, GitLab, ArgoCD)
  • Strong proficiency in: Container orchestration (Kubernetes, Docker)
  • Strong proficiency in: Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)
  • Strong proficiency in: Scripting and automation (Python, Bash, PowerShell)
  • Solid understanding of security, compliance, and networking principles in hybrid environments
  • Exceptional analytical, problem-solving, and incident management skills
  • Demonstrated ability to lead complex, cross-functional initiatives from concept to execution

Nice To Haves

  • Experience in high-availability SaaS or networking environments
  • Knowledge of FinOps, cost optimization, and multi-cloud governance frameworks
  • Familiarity with Zero Trust, identity federation, and cloud access security model
  • Exposure to AI/ML infrastructure or data-driven pipelines is a plus

Responsibilities

  • Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines
  • Lead the adoption of infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar tools to enable repeatable, auditable, and secure deployments
  • Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency
  • Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals
  • Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems
  • Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response
  • Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews
  • Drive root cause analysis, performance tuning, and continuous improvement of production services
  • Collaborate with DevOps and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases
  • Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards
  • Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk
  • Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet
  • Establish and enforce operational best practices for monitoring, patching, and change management across production systems
  • Lead production readiness reviews for new releases and large-scale changes
  • Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements
  • Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution
  • Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence
  • Lead architectural reviews, design sessions, and capacity planning discussions
  • Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service