About The Position

At Algolia, we’re proud to be a pioneer and market leader in AI Search, empowering 17,000+ businesses to deliver blazing-fast, predictive search and browse experiences at internet scale. Every week, we power over 30 billion search requests — four times more than Microsoft Bing, Yahoo, Baidu, Yandex, and DuckDuckGo combined. In 2021, we raised $150 million in Series D funding, quadrupling our valuation to $2.25 billion. This strong foundation enables us to keep investing in our market-leading platform and serving incredible customers like Under Armour, PetSmart, Stripe, Gymshark, and Walgreens. THE MISSION We are building the next generation of AI powered search products. We make AI explainable and we help customers make data driven decisions through. Work with the product function to guide product development through use of analytics and experimentation. You will be an integral part of building the future of AI search. If you’re passionate about turning product data into actionable insights and driving product success, we’d love to hear from you. THE OPPORTUNITY We are seeking a skilled Senior AI / ML Ops Engineer to enable our Data Scientists to move faster and our customers to receive smarter search & discovery experiences by turning prototypes into robust, scalable, and observable AI services. You will own the end-to-end engineering life-cycle—packaging, deploying, operating, and continuously improving machine-learning models that power search ranking, recommendations, and related information-retrieval features on our e-commerce platform.

Requirements

  • Spend 1-2 days per week in a local coworking space to collaborate with your teammates in person.
  • 5+ years of experience in software engineering with 2+ years focused on deploying ML/AI systems at scale.
  • Strong coding skills in Python (preferred) and at least one statically typed language (Go preferred).
  • Hands-on expertise with containerization (Docker), orchestration (Kubernetes/EKS/GKE/AKS), and cloud platforms (AWS, GCP, or Azure).
  • Proven record of building CI/CD pipelines and automated testing frameworks for data or ML workloads.
  • Deep understanding of REST/gRPC APIs, message queues (Kafka, Kinesis, Pub/Sub), and stream/batch data processing frameworks (Spark, Flink, Beam).
  • Experience implementing monitoring, alerting, and logging for mission-critical services.
  • Familiarity with common ML lifecycle tools (MLflow, Kubeflow, SageMaker, Vertex AI, Feature Store, etc.).
  • Working knowledge of ML concepts such as feature engineering, model evaluation, A/B testing, and drift detection.

Responsibilities

  • Productionization & Packaging: Convert notebooks and research codebase into production-ready Python and Go micro-services, libraries, or kubeflow pipelines, and design reproducible build pipelines (Docker, Conda, Poetry) and manage artefacts in centralized registries.
  • Scalable Deployment: Orchestrate real-time and batch inference workloads on Kubernetes, AWS/GCP managed services, or similar platforms, ensuring low latency and high throughput, and Implement blue-green / canary rollouts, automatic rollback, and model versioning strategies (SageMaker, Vertex AI, KServe, MLflow, BentoML, etc.).
  • MLOps & CI/CD: Build and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, Argo) covering unit, integration, data-quality, and performance tests, and Automate feature store updates, model retraining triggers, and scheduled batch jobs using Airflow, Dagster, or similar orchestration tools.
  • Observability & Reliability: Define and monitor SLIs/SLOs for model latency, throughput, accuracy, drift, and cost, and Integrate logging, tracing, and metrics (Datadog etc.) and establish alerting & on-call practices.
  • Data & Feature Engineering: Collaborate with data engineers to create scalable pipelines that ingest clickstream logs, catalog metadata, images, and user signals, and Implement real-time and offline feature extraction, validation, and lineage tracking.
  • Performance & Cost Optimization: Profile models and services; leverage hardware acceleration (GPU, TPU), libraries (ONNX, OpenVINO), and caching strategies (Redis, Faiss) to meet aggressive latency targets, and Right-size clusters and workloads to balance performance with cloud spend.
  • Governance & Compliance: Embed security, privacy, and responsible-AI checks in pipelines; manage secrets, IAM roles, and data-access controls via Terraform or CloudFormation, and Ensure auditability and reproducibility through comprehensive documentation and artifact tracking.
  • Collaboration & Mentorship: Partner closely with Data Scientists, Product Owners, and Site Reliability Engineers to align technical solutions with business goals, and Coach junior engineers on MLOps best practices and contribute to internal knowledge-sharing sessions.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service