Head of Reliability - Blueflame AI

DatasiteNew York City, NY
22d

About The Position

Blueflame AI for Datasite is looking for a Head of Reliability to own reliability, quality, and release assurance across the entire Blueflame AI platform. This is not a support role — it’s a technical leadership position that combines QA and platform reliability ownership to ensure that every feature shipped is tested, stable, and trustworthy. You’ll manage the reliability roadmap, set quality standards, and work closely with our engineering and product teams to make reliability a priority in everything we build.

Requirements

  • 8+ years in reliability, QA, or platform engineering roles, including 1+ years in a management role.
  • Strong experience designing and running QA and automated testing frameworks within CI/CD pipelines.
  • Hands-on experience with AWS cloud infrastructure and observability tools including Datadog and ELK stack.
  • Track record of improving uptime, release quality, and user trust in production environments.
  • Excellent collaboration skills — able to work across Product, Engineering, and Security functions.

Nice To Haves

  • Familiarity with LLM or AI-driven systems a plus (especially testing non-deterministic or probabilistic outputs).

Responsibilities

  • Quality Assurance (QA) Ownership Lead the QA function — defining frameworks, tooling, and processes for automated and manual testing.
  • Ensure every release meets strict reliability and data integrity standards.
  • Work with engineering to build and maintain CI/CD-integrated test automation for frontend, backend, and model workflows.
  • Partner with product managers to define acceptance criteria, regression suites, and go/no-go release thresholds.
  • Reliability & Platform Resilience Define and own Blueflame’s reliability strategy — uptime, latency, and system integrity across core services (API, search, context engine, data integrations).
  • Establish and manage SLOs/SLIs with engineering squads, ensuring proactive monitoring and error budgeting.
  • Review architectural designs for resilience, scalability, and recoverability.
  • Implement and manage monitoring and alerting across our platform, including within AWS.
  • Oversee observability stack and monitoring pipelines (logs, metrics, traces, dashboards).
  • Establish real-time performance insights and alerting mechanisms.
  • Release Assurance & Continuous Improvement Implement consistent release and rollback processes across environments.
  • Manage release readiness reviews and reliability audits.
  • Work with support team for post-incident reviews and implementation of long-term fixes.
  • Leadership & Culture Build and lead a small, high-impact reliability engineering and QA team.
  • Champion quality-by-design principles within all engineering squads.
  • Assist with SOC-2 readiness.

Benefits

  • health insurance (medical, dental, vision)
  • a retirement savings plan
  • paid time off
  • other employee benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Manager

Education Level

No Education Listed

Number of Employees

1,001-5,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service