Core Weave-posted 15 days ago
$139,000 - $204,000/Yr
Full-time • Mid Level
Hybrid • Livingston, NJ
501-1,000 employees
Professional, Scientific, and Technical Services

The Production Engineering team sits at the heart of CoreWeave's reliability efforts. In this role, you'll partner closely with our Support/CX teams to build, operate, and evolve internal tooling that enables a "Direct‑to‑Expert" support model at scale. You'll define and ship AI‑assisted workflows, self‑service diagnostics, and platform integrations that reduce time‑to‑resolution and improve customer experience across our cloud.

  • Design, build, and own support-facing tools for case triage, intelligent routing, and expert engagement, integrating with incident and change management workflows.
  • Develop AI‑powered assistants and automations that accelerate root‑cause discovery, knowledge retrieval, and resolution quality.
  • Create and maintain dashboards, alerts, and signals that surface tooling issues early; integrate observability into new tooling to reduce MTTR.
  • Build self-service and guided diagnostics that empower Support/CX to resolve common issues and collect high‑quality context for escalations.
  • Codify reliability and support practices into services, APIs, and Kubernetes-native controllers/operators where appropriate.
  • Partner with engineering leadership and internal stakeholders to prioritize roadmap initiatives, land adoption, and measure business impact.
  • Participate in an on‑call rotation for the tooling you own.
  • 4+ years of software or infrastructure engineering experience building and operating production services.
  • Proficiency in Go or Python (or equivalent experience).
  • Strong fundamentals in Linux, containers, and Kubernetes; comfortable debugging in distributed systems.
  • Experience with observability (metrics/logs/traces) and using data to improve reliability and support outcomes.
  • Demonstrated experience with incident management and steady‑state operational excellence (e.g., progressive delivery, testing strategies, error budgets, fault‑tolerant design).
  • Comfort collaborating with multiple stakeholders (Support/CX, Product, SRE, and service owners).
  • Experience integrating or building support/operations tooling (e.g., ticketing/incident systems, status page, knowledge management, chat/alerting integrations).
  • Experience automating manual workflows and stitching together productivity platforms.
  • Familiarity with AI/ML tooling for retrieval, summarization, or copilot‑style assistance.
  • Experience codifying operational practices into Kubernetes controllers, operators, or platform services.
  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service