FluidStack-posted 3 months ago
Full-time • Mid Level
11-50 employees

We're looking for a Product Manager to lead Lighthouse, our MLOps and observability platform. You'll own the complete product lifecycle—from strategy and roadmap to execution and customer success. You will work directly with our engineering and infrastructure teams as well as collaborate closely with customers to ensure that we're providing ML developers the metrics that matter. You will have the opportunity to partner with top tier AI labs to increase their utilization and performance as well as scale our infrastructure to hundreds of thousands of GPUs.

  • Building and executing on the roadmap for Lighthouse.
  • Partner with engineering to translate customer requirements into technical specifications and guide implementation.
  • Creating alerting rules for GPU cluster health, job failures, and resource bottlenecks.
  • Designing dashboards for ML-specific KPIs (training loss curves, inference latency, batch processing metrics).
  • Collaborate with sales and customer success teams to drive adoption, gather feedback, and ensure customer satisfaction.
  • Engage directly with AI labs and enterprises to understand their observability challenges and shape the product roadmap accordingly.
  • 3-5+ years of experience building developer tools or cloud infrastructure, ideally in the observability space.
  • Deeply experienced with the LGTM stack, Alertmanager, or proprietary observability tools like Datadog, etc.
  • Have an understanding of the metrics that matter to an AI/ML customer, including infrastructure availability, performance, and utilization, as well as application level metrics like MFU.
  • Understanding of GPU monitoring tools (DCGM, nvidia-smi, GPU exporters for Prometheus).
  • Knowledge of Infrastructure-as-Code (IaC) tools (e.g. Terraform, Pulumi) to standardize and simplify the deployment of the observability stack.
  • Comfortable writing SQL queries.
  • Understanding of SLA, SLO, frameworks and error budget management.
  • Experience with ML-specific monitoring tools (Weights & Biases, ClearML, etc.).
  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service