Director of Customer Reliability Engineering

Nexus Cognitive TechnologiesAtlanta, GA
Onsite

About The Position

NexusOne is seeking a Director of Customer Reliability Engineering to build and own the company's customer reliability function from the ground up. This is a 0-to-1 build role, requiring the establishment of a comprehensive reliability framework, including runbooks, ticket taxonomy, intake models, and incident management. The role will oversee two co-primary teams: Customer Support (L1-L3 ticket flow, on-call, SLAs/SLOs, knowledge management, customer-facing incident response) and Production Engineering/SRE (technical execution for customer environments, platform upgrades, monitoring, alerting, automated response, incident response, change management). The position requires supporting both the inherited Cloudera stack and the modern NX1-native stack (Kubernetes, Spark on K8s, Trino, etc.). The ideal candidate will be a builder, comfortable with hands-on work in the early stages, and will design an AI-native operating model from day one. This role is a growth opportunity, with the potential to scale into a VP or Chief of Customer role within 24-36 months.

Requirements

  • 8-12 years of experience, with 4+ years in customer reliability, support, OR production engineering leadership.
  • Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company.
  • Has run a global team of at least 15 people across at least two geographies.
  • Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback.
  • Owned 24x7 production support with hard SLAs and live enterprise customer escalations.
  • Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure as the primary stack (Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration).
  • Hands-on with modern observability and on-call tooling (OpenTelemetry, Prometheus, Grafana, PagerDuty/incident.io, structured logging, distributed tracing).
  • Has shipped AI or automation into a support OR production-ops workflow with measurable outcomes.
  • High agency, builder over maintainer, founding mindset.
  • Bias toward written clarity.
  • Atlanta-based or genuinely willing to relocate.

Nice To Haves

  • Has operated as part of an MSP or managed service business.
  • Cloudera/CDP familiarity as a secondary asset.
  • Built the support function more than once, or built once and scaled to 50+.
  • Open-source community presence or contributions in the NX1 stack (Airflow, Trino, Iceberg, Spark).
  • Has been on the receiving end of vendor support as a customer and applied those lessons in design.

Responsibilities

  • Design and build the L1-L3 support model from scratch, including tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and an AI-native scaling layer.
  • Own 24x7 production support with hard SLAs across a growing enterprise customer base.
  • Build the knowledge management and runbook infrastructure for both support engineers and AI agents.
  • Choose and configure the support tooling stack (ticketing, incident management, observability, status page).
  • Drive operational communications during incidents, including P1 cadence, customer-readable postmortems, and executive escalation substance.
  • Provide credible L1-L3 support across both Cloudera and NX1-native stacks.
  • Design support tiers, on-call rotations, and capacity models that handle both stacks within a single operational frame.
  • Coordinate with Forward Deployed Engagement Managers (FDEMs) for Cloudera customer engagements, ensuring clear delineation of support vs. operational management.
  • Build the production operations function from scratch for customer environments, including upgrade discipline, monitoring infrastructure, and automated response.
  • Own platform upgrades end-to-end, including scheduling, testing, execution, rollback, and change management hygiene for both NX1 software and Cloudera/CDP.
  • Own the monitoring and alerting infrastructure, designing an AI-native model with automated response as the primary goal.
  • Own production incident response, coordinating with FDEMs and platform engineering.
  • Write operational postmortems and drive root-cause fixes with platform engineering.
  • Build the automation backbone for runbook execution, deployment/upgrades, alert routing, escalation logic, and automated remediation.
  • Hire and develop a team of support and managed service engineers across the US and India.
  • Build sustainable on-call practices with proper geo coverage, comp-time discipline, and escalation hygiene.
  • Grow individual contributors into leads and managers as the organization scales.
  • Protect the team from pre-sales pull and build a handoff model with sales and customer success.
  • Feed reliability substance into QBRs and renewal conversations, supplying operational evidence.
  • Build the voice-of-customer loop to convert reliability signals into product roadmap input.
  • Connect support and reliability performance to ARR, retention, and expansion metrics.

Benefits

  • A collaborative team culture built on curiosity and respect
  • Challenging work where your contributions clearly matter
  • A leadership team that invests in learning and development
  • The opportunity to work at the intersection of cloud, data, and AI innovation
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service