Director of Customer Reliability Engineering

Nexus Cognitive Technologies•Atlanta, GA

8d•Onsite

About The Position

NexusOne is seeking a Director of Customer Reliability Engineering to build and own the company's customer reliability function from the ground up. This is a 0-to-1 build role, establishing the operational engine that ensures NX1's platform is reliable for every customer. The role involves building an AI-native support and production engineering model, managing two co-primary teams: Customer Support (L1-L3 ticket flow, on-call, SLAs/SLOs, runbooks, knowledge management, customer-facing incident response) and Production Engineering/SRE (technical execution for customer environments, platform upgrades, monitoring, alerting, automated response, incident response, change management). The position requires supporting both the modern NX1-native stack (Kubernetes, Spark on K8s, Trino, etc.) and the inherited Cloudera book (CDH/CDP, Impala, Hive, HBase). The ideal candidate will have a builder mindset, experience scaling a global team, and a strong understanding of modern data infrastructure and AI-native operating models. This role is a growth opportunity, with the potential to scale into a VP or Chief of Customer role within 2-3 years.

Requirements

8–12 years of experience; 4+ years in customer reliability, support, OR production engineering leadership.
Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company.
Has run a global team of at least 15 people across at least two geographies.
Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback.
Owned 24x7 production support with hard SLAs and live enterprise customer escalations.
Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure (Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration).
Hands-on with modern observability and on-call tooling (OpenTelemetry, Prometheus, Grafana, PagerDuty / incident.io, structured logging, distributed tracing).
Has shipped AI or automation into a support OR production-ops workflow with measurable outcomes.
High agency, builder over maintainer, founding mindset.
Bias toward written clarity.
Atlanta-based or genuinely willing to relocate.

Nice To Haves

Has operated as part of an MSP or managed service business.
Cloudera/CDP familiarity as a secondary asset.
Built the support function more than once, or built once and scaled to 50+.
Open-source community presence or contributions in the NX1 stack (Airflow, Trino, Iceberg, Spark).
Has been on the receiving end of vendor support as a customer and applied those lessons in design.

Responsibilities

Design and build the L1-L3 support model from scratch, including tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and an AI-native scaling layer.
Own 24x7 production support with hard SLAs across a growing enterprise customer base.
Build knowledge management and runbook infrastructure for both human support engineers and AI agents.
Select and configure the support tooling stack (ticketing, incident management, observability, status page).
Drive operational communications during incidents, including P1 cadence, postmortems, and executive escalations.
Provide credible L1-L3 support across both Cloudera and NX1-native stacks within a single operational frame.
Design support tiers, on-call rotations, and capacity models that handle both stacks.
Coordinate with Forward Deployed Engagement Managers (FDEMs) for Cloudera customer engagements.
Build the production operations function from scratch for customer environments NX1 operates.
Own platform upgrades end-to-end, including scheduling, testing, execution, and rollback discipline for both NX1 software and Cloudera/CDP.
Own the monitoring and alerting infrastructure, designing AI-native automated response as the primary goal.
Own production incident response, coordinating with FDEMs and platform engineering.
Build the automation backbone for runbook execution, deployment, upgrades, alert routing, and automated remediation.
Hire and develop a team of support engineers and managed service engineers across the US and India.
Build sustainable on-call practices with geo coverage and escalation hygiene.
Grow individual contributors into leads and managers as the organization scales.
Protect the team from pre-sales pull and build a handoff model with sales and customer success.
Feed reliability substance into QBRs and renewal conversations.
Build the voice-of-customer loop to convert reliability signals into product roadmap input.
Connect support and reliability performance to ARR, retention, and expansion.