Within Platform Engineering and Enterprise AI Services, the Senior Engineer, Enterprise AI Services is responsible for designing, building, and operating observability capabilities for AI and LLM workloads that power TR's AI-driven products. This role ensures our AI systems—from classical ML to generative AI—are observable, debuggable, reliable, and continuously improving across TR's multi-cloud footprint (AWS, Azure, GCP) and internal Kubernetes platform. You will own the end-to-end observability stack for AI (with Braintrust as foundational platform), enabling product teams, data scientists, and AI engineers to understand model behavior in production, detect issues early, and make data-driven improvements to model quality, latency, and cost. The successful candidate will help build the next generation of TR's AI observability and evaluation platform, working alongside cloud engineering, data engineering, Enterprise AI Services, and product teams. In this opportunity as a Senior Engineer – Enterprise AI Services, you will: Serve as the Kubernetes expert for AI services, defining and operating deployment standards for scalability, resilience, security, and performance. Own the AI observability platform, implementing tools such as Braintrust and Langfuse to support tracing, evaluation, analytics, and monitoring of LLM/ML workloads. Define and standardize telemetry across AI products, including traces, metrics, logs, evaluations, and feedback, while ensuring governance, privacy, and auditability requirements are met. Build telemetry pipelines, dashboards, and reporting that provide clear visibility into model performance, quality, safety, reliability, and cost. Establish monitoring, alerting, SLOs/SLIs, and incident response practices for AI systems, including root cause analysis and continuous improvement. Integrate observability and evaluation into CI/CD so new models, prompts, and workflows are automatically enrolled in monitoring and quality controls. Partner with Product, Data Science, AI Engineering, SRE, Platform, and Cloud teams to onboard new AI use cases, support experimentation and drift detection, and implement guardrails and policy enforcement.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed