Senior Machine Learning Engineer

DocusignSan Francisco, CA
9dHybrid

About The Position

We are looking for a Senior Machine Learning Engineer to redefine how we operate our global services. You won't just be building dashboards; you will be building the "brain" of our infrastructure. We are moving beyond simple anomaly detection. We are building a self-healing ecosystem where Multi-Agent Systems and Reinforcement Learning (RL) loops work in tandem with Large Language Models (LLMs) to not only detect incidents in real-time but to troubleshoot and resolve them autonomously. If you are passionate about applying complex AI architectures to massive datasets (billions of telemetry points) to solve real-world reliability challenges, this is the role for you. This position is an individual contributor role reporting to the Sr. Director, Software Engineering.

Requirements

  • 8+ years of professional experience in Machine Learning Engineering or Data Science
  • Experience with PyTorch or TensorFlow, specifically regarding Time Series analysis (forecasting/anomaly detection) and NLP
  • Experience building applications using LLMs (RAG pipelines, LangChain, vector databases) specifically for technical domains (code analysis, log parsing)
  • Experience with RL concepts (policies, rewards, agents) and experience applying them to optimization or control problems
  • Experience with distributed data processing and streaming technologies (Apache Spark, Kafka, Flink)
  • Expereience with software engineering fundamentals (Python, C++, or Go), CI/CD for ML, and experience deploying models via APIs (FastAPI, Triton Inference Server)

Nice To Haves

  • Familiarity with the "three pillars" (Logs, Metrics, Traces) and tools like Prometheus, Grafana, OpenTelemetry, or Jaeger
  • Experience with frameworks like AutoGen, CrewAI, or Ray RLlib
  • Deep experience with AWS/GCP/Azure and Kubernetes (K8s) orchestration
  • A background in control theory or causal inference

Responsibilities

  • Design and implement autonomous multi-agent systems using Reinforcement Learning (RL) loops that can interact with our infrastructure to perform safe, automated remediation actions
  • Build GenAI agents capable of digesting logs, traces, and metrics to provide "Human-in-the-loop" root cause analysis and conversational debugging for our SREs
  • Develop and deploy deep learning models (Transformers, LSTMs, etc.) for forecasting and anomaly detection on high-cardinality, high-volume time series data
  • Optimize inference pipelines to run with low latency on streaming telemetry data (Kafka/Flink), ensuring we catch issues the moment they happen
  • Own the lifecycle of your models—from feature engineering on petabyte-scale datasets to training, deployment, and monitoring in production Kubernetes environments
  • Collaborate with Applied Scientists to translate bleeding-edge research (e.g., causal inference, decision transformers) into production-hardened AIOps tools

Benefits

  • Bonus: Sales personnel are eligible for variable incentive pay dependent on their achievement of pre-established sales goals. Non-Sales roles are eligible for a company bonus plan, which is calculated as a percentage of eligible wages and dependent on company performance.
  • Stock: This role is eligible to receive Restricted Stock Units (RSUs).
  • Paid Time Off: earned time off, as well as paid company holidays based on region
  • Paid Parental Leave: take up to six months off with your child after birth, adoption or foster care placement
  • Full Health Benefits Plans: options for 100% employer paid and minimum employee contribution health plans from day one of employment
  • Retirement Plans: select retirement and pension programs with potential for employer contributions
  • Learning and Development: options for coaching, online courses and education reimbursements
  • Compassionate Care Leave: paid time off following the loss of a loved one and other life-changing events
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service