Senior Software Engineer - Python APIs

Fidelity Investments•Jersey City, NJ

22h•Hybrid

About The Position

As a Senior Software Engineer on the Enterprise AI/ML Platform team, you will design, build, and operate production-grade software systems that enable machine learning at scale across the organization. You will focus on developing robust web services, libraries, and automation that support model training, evaluation, deployment, and lifecycle management in highly regulated, high-availability environments. This role is ideal for engineers who enjoy working on distributed systems, developer platforms, and infrastructure-heavy codebases, and who have applied software engineering best practices to systems that support AI/ML workloads.

Requirements

Bachelor’s or Master’s degree in a technology-focused discipline such as Computer Science, Software Engineering, or a closely related field.
Strong software engineering experience in Python, with a proven ability to design, implement, test, and maintain production-quality libraries, services, and internal platforms.
Comfortable applying object-oriented and functional programming principles in Linux-based environments; scripting and automation experience required.
Possess 5 years of professional experience developing Python-based cloud applications or internal platforms, with demonstrated ownership of non-trivial systems in production.
Experience building and operating cloud-native systems on AWS, including services such as S3, Lambda, Batch, Step Functions, EventBridge, CloudWatch, and SNS/SQS.
Experience supporting managed ML services (e.g., SageMaker or equivalent) as part of a broader platform.
Strong DevOps and CI/CD experience, including automated build, test, and deployment pipelines using tools such as Jenkins and Git-based workflows.
Hands-on experience with containerization (Docker) and deploying containerized workloads in scalable environments.
Hands-on experience supporting ML-enabled systems in production, including model packaging, deployment, inference workflows, monitoring, and operational measurement.
Emphasis on system reliability, observability, and maintainability, rather than model experimentation or research.
Familiarity with applied machine learning concepts and data workflows, including feature pipelines and working with structured, semi-structured, and unstructured data, sufficient to design and support scalable ML platforms and integrations.
Strong understanding of scalable and distributed system design, with experience building or operating systems that handle high-throughput workloads, asynchronous processing, and fault tolerance using open-source technologies.
Proven experience supporting business-critical applications, including troubleshooting production issues, performing root cause analysis, and driving improvements to system stability and performance.
Excellent communication and collaboration skills, with the ability to clearly document systems, communicate technical tradeoffs, and work effectively across engineering, data, and business teams.
Ability to operate effectively in ambiguous, fast-paced environments, adapting to evolving business priorities and technology changes within a broader AI and data ecosystem.

Nice To Haves

Familiarity with Java or Groovy is a plus.
Familiarity with LiteLLM or GenAI Gateways is a plus.
Exposure to Azure or GCP is beneficial but not required.
Infrastructure-as-Code expertise is a plus, using tools such as AWS CloudFormation and Terraform/OpenTofu to provision, manage, and evolve cloud infrastructure in a repeatable and auditable manner.

Responsibilities

Partner with Data Scientists to package, scale, and operationalize models, providing the platforms and tooling required to move from experimentation to reliable production use.
Collaborate with application and platform engineers to integrate ML capabilities with enterprise gateways and services, enabling secure and scalable access to both traditional and generative models.
Operationalize ML-enabled systems at enterprise scale, designing and supporting services capable of serving predictions to tens of millions of customers with high reliability and performance.
Build platform tooling for model and data observability, including detection of data and feature drift, monitoring prediction quality and uncertainty, and automating diagnostics and explainability workflows.
Continuously evaluate and adopt emerging technologies, applying sound engineering judgment to simplify the data and ML ecosystem while improving developer experience and operational stability.
Drive innovation through pragmatic, forward-looking solutions, balancing future capabilities with production readiness and long-term maintainability.
Improve team agility and productivity by introducing reusable frameworks, automation, and clear abstractions that reduce friction for downstream consumers.
Resolve technical roadblocks and mitigate platform risks, proactively addressing scalability, reliability, and integration challenges.
Increase delivery velocity and system reliability through automation, including the design and maintenance of robust CI/CD pipelines and operational workflows.