Principal Dev Ops Engineer

Iridium Satellite, LLC•Tempe, AZ

9h•Hybrid

About The Position

We are seeking a highly skilled Principal DevOps Engineer to lead the strategy, design, and evolution of DevOps practices supporting our cloud-native Open RAN and 4G/5G Core network. In this role, you will set the technical direction for CI/CD, infrastructure-as-code, automation, and observability frameworks that enable reliable, scalable operations across Core, RAN, Transport, and Cloud domains. You will define and implement greenfield CI/CD pipelines, establish standardized automation and monitoring approaches, and create advanced telemetry, alerting, and automated remediation capabilities. Through close partnership with NOC Operations, Engineering, Cloud, Development, and Test teams, you will help drive operational excellence, reduce Mean Time to Repair (MTTR), and minimize alert fatigue. As a technical leader within the Gateway organization, you will provide governance, best practices, and hands‑on expertise to teams across global time zones. The ideal candidate brings deep experience with cloud‑native architectures, Kubernetes, CI/CD, telemetry pipelines, and infrastructure‑as‑code, along with familiarity in telecom network environments and Agile practices.

Requirements

Bachelor’s degree in Engineering, Computer Science, Telecommunications, or related field
10+ years of experience in DevOps, Site Reliability Engineering, or network automation roles supporting cloud‑native environments
Strong proficiency with CI/CD pipeline management, Infrastructure-as-Code frameworks, and containerized deployments
Hands-on experience with Kubernetes (EKS and on-prem K8s) and Docker-based cloud-native network functions (CNFs)
Proficiency with AWS cloud services
Advanced Python scripting skills, with additional experience in Bash or Go
Experience building Grafana dashboards, alerting logic, and observability workflows
Familiarity with Kafka-based event streaming architectures
Strong Linux system administration skills
Strong understanding of telecom architecture, including 4G EPC, 5G Core, IMS, Open RAN
Experience integrating and operationalizing probe-based observability solutions (e.g., Viavi)
Deep understanding of monitoring concepts, including metrics, logs, traces, and APM
Excellent communication skills, with the ability to convey products, deliverables, analyses, and/or issues clearly and confidently, and recognize and adapt to different communication techniques
Be able to analyze a situation or problem, generate effective solutions, and see those solutions through to completion
Must possess the creativity and resourcefulness needed to make reliable decisions and determine methods on new assignments
Can thrive in a dynamic environment by handling multiple tasks and managing shifting priorities
Be proactive in sharing knowledge you’ve learned with others

Nice To Haves

Experience supporting Mavenir 4G/5G Core in production
Knowledge of SIP, Diameter, GTP, HTTP/2, PFCP protocols
Experience with Prometheus, ELK stack, or OpenTelemetry
CI/CD experience (GitLab, Jenkins, ArgoCD)
Kubernetes certification (CKA/CKAD)
AWS certifications
Experience building closed-loop automation for telecom NOCs

Responsibilities

Lead the design and implementation of CI/CD pipelines supporting cloud-native and G-RAN deployments
Manage Kubernetes environments (EKS and on-prem) by monitoring CNF health, automating scaling policies, and optimizing resource allocation
Implement Infrastructure-as-Code solutions using Terraform and Ansible to deploy and maintain monitoring and observability stacks
Integrate observability platforms and tools into operational workflows to strengthen visibility and diagnostic capabilities
Design and enhance observability frameworks using Grafana dashboards and alert correlation, health checks/Back Ups etc., Core CDR dashboards (IMS & Packet Core), Viavi probe integrations, and SolarWinds telemetry feeds
Build unified dashboards that provide national‑level visibility and real‑time health insights
Optimize alarm thresholds and event correlation to reduce false positives and alert storms
Implement structured logging, metrics, and distributed tracing for cloud‑native network functions
Develop automation using Python, Bash, or Go to auto-triage common alarms, perform health validations, and trigger corrective actions and workflows
Build event‑driven automation using Kafka feeds from Mavenir and Gatehouse OSS systems
Implement automated remediation for common failure scenarios (e.g., pod restarts, resource exhaustion, signaling retries) to reduce manual NOC intervention
Reduce manual NOC intervention through closed-loop automation
Implement Infrastructure as Code (Terraform/Ansible) for monitoring stack deployments
Integrate observability tools into DevSecOps workflows
Support Major Incident Management by providing telemetry insights, automated diagnostics, and post‑incident analyses
Perform post-incident analysis using logs, traces, and performance metrics
Drive improvements that reduce MTTD and MTTR
Partner with Core, RAN, Transport, and Cloud engineering teams to prevent recurring issues through root‑cause analysis
Mentor junior DevOps and NOC engineers in automation, observability, and DevOps best practices
Develop reusable automation frameworks and operational standards
Document playbooks, reference architectures, and best‑practice patterns to mature operations from reactive to predictive
Participate in on-call rotations for automation platform support
Support major incidents requiring automation troubleshooting
Travel up to 10% if needed