Principal Cloud and Production Operations Engineer

Qode•California City, CA

11d

About The Position

The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications. This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments. Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, DevOps, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.

Requirements

Bachelor’s degree in Computer Science, Information Systems, or related field; Master’s preferred
10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role
Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management
Proven experience managing production-scale environments supporting mission-critical applications and services
Strong proficiency in: Infrastructure-as-code (Terraform, CloudFormation)
Strong proficiency in: CI/CD and DevOps toolchains (Jenkins, GitLab, ArgoCD)
Strong proficiency in: Container orchestration (Kubernetes, Docker)
Strong proficiency in: Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)
Strong proficiency in: Scripting and automation (Python, Bash, PowerShell)
Solid understanding of security, compliance, and networking principles in hybrid environments
Exceptional analytical, problem-solving, and incident management skills
Demonstrated ability to lead complex, cross-functional initiatives from concept to execution

Nice To Haves

Experience in high-availability SaaS or networking environments
Knowledge of FinOps, cost optimization, and multi-cloud governance frameworks
Familiarity with Zero Trust, identity federation, and cloud access security model
Exposure to AI/ML infrastructure or data-driven pipelines is a plus

Responsibilities

Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines
Lead the adoption of infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar tools to enable repeatable, auditable, and secure deployments
Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency
Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals
Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems
Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response
Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews
Drive root cause analysis, performance tuning, and continuous improvement of production services
Collaborate with DevOps and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases
Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards
Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk
Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet
Establish and enforce operational best practices for monitoring, patching, and change management across production systems
Lead production readiness reviews for new releases and large-scale changes
Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements
Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution
Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence
Lead architectural reviews, design sessions, and capacity planning discussions
Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies