Senior ML Ops Engineer

Mimecast•Columbus, OH

1d•$148,000 - $222,000•Hybrid

About The Position

As a Senior ML Ops Engineer at Mimecast, you will be a technical leader on the AI Enablement Platform (AIP) team, responsible for ensuring that machine learning models and AI agents are deployed, scaled, observed, and maintained reliably across production environments. The AI Enablement Platform serves billions of requests per month across multiple regions, powering AI-driven capabilities in email security, insider risk, data loss prevention, and collaboration security for Mimecast's Human Risk Management platform. This role sits at the intersection of infrastructure engineering and machine learning. You will own the design and implementation of self-service deployment tooling, platform resilience and scaling infrastructure, and operational best practices that enable ML Engineers and Data Scientists to ship models and agents independently, with confidence. You will also be responsible for building and maintaining the developer platform that accelerates the work of ML practitioners across the organization. This is a senior individual contributor role. You are expected to drive architectural decisions, mentor other engineers, define standards, and operate with a high degree of autonomy. You will collaborate closely with ML Engineers, Software Engineers, SRE, and Cloud Platform teams.

Requirements

Strong experience with AWS, particularly SageMaker (endpoints), EC2, ECS/EKS, SQS, S3, CloudWatch, and IAM.
Proficiency in Python, Java (Spring Boot), and Bash scripting.
Deep expertise with Infrastructure as Code (Terraform) and containerization (Docker, Kubernetes), including scaling policies, ConfigMaps, and resource management.
Strong experience designing and maintaining CI/CD pipelines (Jenkins or GitHub Actions preferred), including lifecycle management, automated testing and deployment gates for ML workflows.
Demonstrated ability to build and tune autoscaling, rate limiting, and traffic management systems for high-throughput, latency-sensitive services.
Solid understanding of observability tooling and practices (Grafana, CloudWatch, Open Telemetry) and experience building monitoring for ML model performance in production.
Familiarity with ML frameworks (PyTorch), ML lifecycle tools (MLflow, SageMaker), and model serving patterns (real-time inference, batch transform, async processing).
Experience working in multi-region, production-grade environments handling high request volumes.
Experience and enthusiasm using AI-assisted development tooling to accelerate your own work and the work of ML Engineers.
Comfortable operating as a technical authority across ML Engineering, Software Engineering, SRE, and product teams—influencing outcomes through expertise and trust, not org chart position.

Nice To Haves

Experience building self-service developer platforms or internal tooling for ML/data teams.
Exposure to LLM serving infrastructure (model hosting, prompt management, token-level observability) and agentic AI deployment patterns.
Experience with cost allocation, FinOps, or cloud cost optimization for ML workloads.
Background in cybersecurity or experience operating AI systems in regulated or security-sensitive environments.
Familiarity with the Model Context Protocol (MCP) or similar agent-tool integration standards.
Experience with Triton and ONNX preferred.

Responsibilities

Design and build config-driven, validated workflows that enable ML Engineers to deploy models to AIP infrastructure without requiring hands-on ML Ops involvement for each release. This includes automated validation pipelines, standardized configuration schemas, endpoint provisioning, and derisked rollout patterns (canary, blue-green, rollback).
Own the reliability and scalability of ML inference infrastructure. Design and tune autoscaling policies against real production traffic patterns, implement rate limiting and backpressure mechanisms (HTTP 429, retry-after) at the API layer, and build request prioritization frameworks (real-time vs. batch) so the platform protects itself under load without manual intervention or consumer-side changes.
Develop and maintain the platform's observability stack (metrics, logging, tracing, alerting) so that monitoring is wired in by default for every deployed model and agent. Continuously monitor model performance, data drift, latency, error rates, and system health. Build dashboards and alerting that give both the AIP team and consuming teams visibility into their workloads.
Design, implement, and maintain robust CI/CD pipelines for ML model and infrastructure deployments. Automate testing (functional, integration, performance) as pre-deployment gates that ML Engineers can trigger themselves, with clear pass/fail criteria.
Manage all AIP infrastructure through Terraform and configuration management tooling. Maintain multi-region deployment capabilities and ensure infrastructure changes are reviewable, repeatable, and auditable.
Implement and enforce cost tagging and allocation at deployment time. Optimize ML inference endpoints for cost-effectiveness, including right-sizing instance types, managing reserved capacity, and providing opinionated endpoint configuration recommendations based on model characteristics.
Support the deployment and operational management of AI agents and LLM-based capabilities within the AIP's templatized agent framework. This includes infrastructure for agent hosting, tool access configuration, and observability for agentic workloads.
Ensure ML systems adhere to security best practices, including input validation, authentication, network exposure controls, and automated security scanning for model configurations. Support compliance with regulatory requirements relevant to AI systems in the cybersecurity domain.
Mentor ML Ops and ML Engineers on operational best practices. Participate in architectural reviews, contribute to platform governance, and drive engineering standards through documentation, code reviews, and design discussions. Represent ML Ops in cross-functional planning with SRE, Cloud Platform, and consuming product teams.

Benefits

Formal and ‘on the job’ learning opportunities
Comprehensive benefits package that helps our employees and their family members to sustain a healthy lifestyle
Working in cross functional teams to build your knowledge
Hybrid working model
Flexibility to live balanced, healthy lives

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume