Principal Engineer - Python API Development

Fidelity Investments•Jersey City, NJ

1d•Hybrid

About The Position

As a Principal Engineer on the Enterprise AI/ML Platform team, you will tackle the most complex technical challenges involved in delivering machine learning at enterprise scale. You will design, build, and evolve reliable, secure, and cost‑efficient platform capabilities—from model packaging and serving to observability and lifecycle management—working closely with multiple teams to ensure these capabilities are practical, robust, and widely usable in production. You will take a hands‑on role across enterprise repositories, improving shared services, CI/CD workflows, and infrastructure patterns where they have the greatest impact. This includes deep technical investigation of performance and scalability issues, such as tracking down bottlenecks in web services, analyzing system and application metrics, and optimizing GPU utilization, throughput, and resource efficiency across ML workloads.

Requirements

Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a closely related engineering discipline; 8+ years (typically 10+) building and operating production platforms and services at scale.
Deep software engineering expertise in Python and distributed systems, with a track record of building production‑grade services, libraries, and internal platforms.
You model engineering excellence through clean designs, automated testing, and maintainable abstractions; Linux fluency and scripting are required.
Cloud platform leadership (AWS)—hands‑on with S3, Lambda, Batch, Step Functions, EventBridge, CloudWatch, and SNS/SQS—and experience shaping platform patterns that other teams adopt.
DevOps and CI/CD at scale, owning standards for automated build/test/deploy (e.g., Jenkins, Git‑based workflows), containerization (Docker), release governance, and multi‑environment promotion for ML‑enabled workloads.
Infrastructure as Code (CloudFormation, Terraform/OpenTofu) and platform reliability engineering (SLOs/error budgets, capacity planning, cost observability, incident response, and post‑mortems) for ML serving and data/feature pipelines.
ML enablement in production: model packaging, deployment strategies (batch/online/streaming), inference routing, traffic management, performance tuning, observability, and controls for responsible use—without a research or modeling focus.
Cross‑org technical leadership: you mentor junior and senior engineers, are a backbone of code review across repos, and routinely consider impacts on upstream/downstream systems when proposing changes.

Nice To Haves

Familiarity with Java or Groovy is a plus.
Knowledge or experience with GenAI Gateways or LiteLLM a big plus.
Exposure to Azure or GCP is beneficial.
Experience enabling managed ML services (e.g., SageMaker) as part of broader platform capabilities.

Responsibilities

Set platform strategy and standards for ML packaging, deployment, serving, and observability—driving consistent adoption across squads and business units.
Partner with Data Scientists to package, scale, and operationalize models; define the APIs, guardrails, and automation that take work from experimentation to reliable production.
Enable secure, scalable access to traditional and generative models by collaborating with platform and application engineers to integrate through enterprise gateways and services.
Advance model/data observability—tooling for data and feature drift detection, prediction‑quality monitoring and uncertainty signals, and automated diagnostics/ explainability.
Lead cross‑platform incident response and post‑mortems, drive systemic fixes, and evolve standards to prevent recurrence—across applications and the platform.
Uplevel engineering velocity by introducing reusable frameworks, paved paths, and CI/CD templates that simplify integration, reduce toil, and improve reliability at scale.
Reduce cost and complexity across the ML ecosystem through pragmatic technology choices, clear abstractions, and a long‑term platform roadmap.