Senior Machine Learning Engineer, AI Platform

Mozilla

4d•$139,000 - $218,000•Remote

About The Position

The AI Platform team at Mozilla is responsible for developing the core infrastructure that supports intelligent features across Mozilla's product suite. This involves creating model training pipelines, high-throughput inference services, GPU orchestration, and secure, privacy-conscious AI systems designed for global-scale reliability. Mozilla is seeking a Machine Learning Engineer with a strong platform-oriented approach to contribute to the design, development, and operation of Mozilla's AI platform. This role sits at the nexus of machine learning, distributed systems, and production infrastructure, ensuring efficient, secure, and scalable training, deployment, and serving of models. Collaboration with product, infrastructure, and security teams is key to facilitating rapid iteration while adhering to stringent performance and privacy standards.

Requirements

Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or equivalent work experience.
Strong experience developing in Python for machine learning systems, backend services, or distributed data processing.
Proven experience deploying and operating ML workloads in cloud environments with production-grade infrastructure.
Solid understanding of model serving architectures, inference pipelines, and performance trade-offs (latency, throughput, cost, scaling strategies).
Hands-on experience with GPU-based workloads and accelerated computing in production settings.
Experience designing CI/CD pipelines and development workflows for reliable ML system deployment.
Ability to independently scope and drive technical initiatives while balancing product and operational priorities.
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems.
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams.

Nice To Haves

Experience implementing inference optimization strategies such as batching, quantization, compilation, model conversion, or hardware-specific tuning.
Familiarity with containerization and orchestration systems (e.g., Docker, Kubernetes) in production environments.
Experience designing observability systems for distributed services, including metrics strategy and performance profiling.
Exposure to privacy-preserving ML techniques, security best practices, or responsible AI system design.
Contributions to open-source ML infrastructure projects or leadership in building reusable internal ML tooling.

Responsibilities

Design, build, and operate core AI platform components for training, deploying, and serving machine learning models in production.
Manage end-to-end model serving and inference workflows, focusing on improvements in reliability, scalability, performance, and operational excellence.
Lead optimization efforts for inference systems to enhance throughput, reduce latency, and improve cost efficiency across CPU and GPU workloads.
Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization.
Oversee and enhance critical aspects of the model lifecycle, such as packaging, versioning, testing, validation, and deployment automation.
Implement and refine observability practices (metrics, logging, tracing, alerting) to boost the visibility and operational resilience of ML services and pipelines.
Collaborate with product, infrastructure, security, and data teams to architect scalable platform capabilities for AI-powered features.
Participate in technical design discussions, propose architectural enhancements, and mentor junior engineers through code reviews and knowledge sharing.
Engage in and improve operational processes, including incident response, on-call duties, and post-incident reviews.