About The Position

The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications. This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments. Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, DevOps, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.

Requirements

  • Bachelor’s degree in Computer Science, Information Systems, or related field; Master’s preferred.
  • 10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role.
  • Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management.
  • Proven experience managing production-scale environments supporting mission-critical applications and services.
  • Strong proficiency in: Infrastructure-as-code (Terraform, CloudFormation).
  • CI/CD and DevOps toolchains (Jenkins, GitLab, ArgoCD).
  • Container orchestration (Kubernetes, Docker).
  • Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK).
  • Scripting and automation (Python, Bash, PowerShell).
  • Solid understanding of security, compliance, and networking principles in hybrid environments.
  • Exceptional analytical, problem-solving, and incident management skills.
  • Demonstrated ability to lead complex, cross-functional initiatives from concept to execution.

Nice To Haves

  • Experience in high-availability SaaS or networking environments.
  • Knowledge of FinOps, cost optimization, and multi-cloud governance frameworks.
  • Familiarity with Zero Trust, identity federation, and cloud access security model.
  • Exposure to AI/ML infrastructure or data-driven pipelines is a plus.

Responsibilities

  • Cloud Architecture and Engineering: Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines.
  • Lead the adoption of infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar tools to enable repeatable, auditable, and secure deployments.
  • Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency.
  • Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals.
  • Production Operations and Reliability: Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems.
  • Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response.
  • Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews.
  • Drive root cause analysis, performance tuning, and continuous improvement of production services.
  • Automation and CI/CD Enablement: Collaborate with DevOps and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases.
  • Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards.
  • Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk.
  • Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet.
  • Operational Governance and Collaboration: Establish and enforce operational best practices for monitoring, patching, and change management across production systems.
  • Lead production readiness reviews for new releases and large-scale changes.
  • Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements.
  • Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution.
  • Leadership and Mentorship: Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence.
  • Lead architectural reviews, design sessions, and capacity planning discussions.
  • Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service