Director of Customer Reliability Engineering

Nexus Cognitive Technologies•Atlanta, GA

9d•Onsite

About The Position

NexusOne is seeking a Director of Customer Reliability Engineering to build and own the company's customer reliability function from the ground up. This is a 0-to-1 build role, requiring the establishment of a comprehensive reliability framework, including runbooks, ticket taxonomy, intake models, and incident management. The role will oversee two co-primary teams: Customer Support (L1-L3 ticket flow, on-call, SLAs/SLOs, knowledge management, customer-facing incident response) and Production Engineering/SRE (technical execution for customer environments, platform upgrades, monitoring, alerting, automated response, incident response, change management). The position requires supporting both the inherited Cloudera stack and the modern NX1-native stack (Kubernetes, Spark on K8s, Trino, etc.). The ideal candidate will be a builder, comfortable with hands-on work in the early stages, and will design an AI-native operating model from day one. This role is a growth opportunity, with the potential to scale into a VP or Chief of Customer role within 24-36 months.

Requirements

8-12 years of experience, with 4+ years in customer reliability, support, OR production engineering leadership.
Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company.
Has run a global team of at least 15 people across at least two geographies.
Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback.
Owned 24x7 production support with hard SLAs and live enterprise customer escalations.
Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure as the primary stack (Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration).
Hands-on with modern observability and on-call tooling (OpenTelemetry, Prometheus, Grafana, PagerDuty/incident.io, structured logging, distributed tracing).
Has shipped AI or automation into a support OR production-ops workflow with measurable outcomes.
High agency, builder over maintainer, founding mindset.
Bias toward written clarity.
Atlanta-based or genuinely willing to relocate.

Nice To Haves

Has operated as part of an MSP or managed service business.
Cloudera/CDP familiarity as a secondary asset.
Built the support function more than once, or built once and scaled to 50+.
Open-source community presence or contributions in the NX1 stack (Airflow, Trino, Iceberg, Spark).
Has been on the receiving end of vendor support as a customer and applied those lessons in design.

Responsibilities

Design and build the L1-L3 support model from scratch, including tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and an AI-native scaling layer.
Own 24x7 production support with hard SLAs across a growing enterprise customer base.
Build the knowledge management and runbook infrastructure for both support engineers and AI agents.
Choose and configure the support tooling stack (ticketing, incident management, observability, status page).
Drive operational communications during incidents, including P1 cadence, customer-readable postmortems, and executive escalation substance.
Provide credible L1-L3 support across both Cloudera and NX1-native stacks.
Design support tiers, on-call rotations, and capacity models that handle both stacks within a single operational frame.
Coordinate with Forward Deployed Engagement Managers (FDEMs) for Cloudera customer engagements, ensuring clear delineation of support vs. operational management.
Build the production operations function from scratch for customer environments, including upgrade discipline, monitoring infrastructure, and automated response.
Own platform upgrades end-to-end, including scheduling, testing, execution, rollback, and change management hygiene for both NX1 software and Cloudera/CDP.
Own the monitoring and alerting infrastructure, designing an AI-native model with automated response as the primary goal.
Own production incident response, coordinating with FDEMs and platform engineering.
Write operational postmortems and drive root-cause fixes with platform engineering.
Build the automation backbone for runbook execution, deployment/upgrades, alert routing, escalation logic, and automated remediation.
Hire and develop a team of support and managed service engineers across the US and India.
Build sustainable on-call practices with proper geo coverage, comp-time discipline, and escalation hygiene.
Grow individual contributors into leads and managers as the organization scales.
Protect the team from pre-sales pull and build a handoff model with sales and customer success.
Feed reliability substance into QBRs and renewal conversations, supplying operational evidence.
Build the voice-of-customer loop to convert reliability signals into product roadmap input.
Connect support and reliability performance to ARR, retention, and expansion metrics.