Director of Customer Reliability Engineering

Nexus Cognitive TechnologiesAtlanta, GA
Onsite

About The Position

NexusOne is seeking a Director of Customer Reliability Engineering to build and own the company's customer reliability function from the ground up. This is a 0-to-1 build role, establishing the operational engine that ensures NX1's platform is reliable for every customer. The role involves building an AI-native support and production engineering model, managing two co-primary teams: Customer Support (L1-L3 ticket flow, on-call, SLAs/SLOs, runbooks, knowledge management, customer-facing incident response) and Production Engineering/SRE (technical execution for customer environments, platform upgrades, monitoring, alerting, automated response, incident response, change management). The position requires supporting both the modern NX1-native stack (Kubernetes, Spark on K8s, Trino, etc.) and the inherited Cloudera book (CDH/CDP, Impala, Hive, HBase). The ideal candidate will have a builder mindset, experience scaling a global team, and a strong understanding of modern data infrastructure and AI-native operating models. This role is a growth opportunity, with the potential to scale into a VP or Chief of Customer role within 2-3 years.

Requirements

  • 8–12 years of experience; 4+ years in customer reliability, support, OR production engineering leadership.
  • Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company.
  • Has run a global team of at least 15 people across at least two geographies.
  • Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback.
  • Owned 24x7 production support with hard SLAs and live enterprise customer escalations.
  • Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure (Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration).
  • Hands-on with modern observability and on-call tooling (OpenTelemetry, Prometheus, Grafana, PagerDuty / incident.io, structured logging, distributed tracing).
  • Has shipped AI or automation into a support OR production-ops workflow with measurable outcomes.
  • High agency, builder over maintainer, founding mindset.
  • Bias toward written clarity.
  • Atlanta-based or genuinely willing to relocate.

Nice To Haves

  • Has operated as part of an MSP or managed service business.
  • Cloudera/CDP familiarity as a secondary asset.
  • Built the support function more than once, or built once and scaled to 50+.
  • Open-source community presence or contributions in the NX1 stack (Airflow, Trino, Iceberg, Spark).
  • Has been on the receiving end of vendor support as a customer and applied those lessons in design.

Responsibilities

  • Design and build the L1-L3 support model from scratch, including tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and an AI-native scaling layer.
  • Own 24x7 production support with hard SLAs across a growing enterprise customer base.
  • Build knowledge management and runbook infrastructure for both human support engineers and AI agents.
  • Select and configure the support tooling stack (ticketing, incident management, observability, status page).
  • Drive operational communications during incidents, including P1 cadence, postmortems, and executive escalations.
  • Provide credible L1-L3 support across both Cloudera and NX1-native stacks within a single operational frame.
  • Design support tiers, on-call rotations, and capacity models that handle both stacks.
  • Coordinate with Forward Deployed Engagement Managers (FDEMs) for Cloudera customer engagements.
  • Build the production operations function from scratch for customer environments NX1 operates.
  • Own platform upgrades end-to-end, including scheduling, testing, execution, and rollback discipline for both NX1 software and Cloudera/CDP.
  • Own the monitoring and alerting infrastructure, designing AI-native automated response as the primary goal.
  • Own production incident response, coordinating with FDEMs and platform engineering.
  • Build the automation backbone for runbook execution, deployment, upgrades, alert routing, and automated remediation.
  • Hire and develop a team of support engineers and managed service engineers across the US and India.
  • Build sustainable on-call practices with geo coverage and escalation hygiene.
  • Grow individual contributors into leads and managers as the organization scales.
  • Protect the team from pre-sales pull and build a handoff model with sales and customer success.
  • Feed reliability substance into QBRs and renewal conversations.
  • Build the voice-of-customer loop to convert reliability signals into product roadmap input.
  • Connect support and reliability performance to ARR, retention, and expansion.

Benefits

  • A collaborative team culture built on curiosity and respect
  • Challenging work where your contributions clearly matter
  • A leadership team that invests in learning and development
  • The opportunity to work at the intersection of cloud, data, and AI innovation
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service