Staff Software Engineer, SDK

Weights & Biases•Livingston, NJ

4h•Hybrid

About The Position

CoreWeave, the AI Hyperscaler™, acquired Weights & Biases to create the most powerful end-to-end platform to develop, deploy, and iterate AI faster. Since 2017, CoreWeave has operated a growing footprint of data centers covering every region of the US and across Europe, and was ranked as one of the TIME100 most influential companies of 2024. By bringing together CoreWeave’s industry-leading cloud infrastructure with the best-in-class tools AI practitioners know and love from Weights & Biases, we’re setting a new standard for how AI is built, trained, and scaled. The integration of our teams and technologies is accelerating our shared mission: to empower developers with the tools and infrastructure they need to push the boundaries of what AI can do. From experiment tracking and model optimization to high-performance training clusters, agent building, and inference at scale, we’re combining forces to serve the full AI lifecycle — all in one seamless platform. Weights & Biases has long been trusted by over 1,500 organizations — including AstraZeneca, Canva, Cohere, OpenAI, Meta, Snowflake, Square,Toyota, and Wayve — to build better models, AI agents and applications. Now, as part of CoreWeave, that impact is amplified across a broader ecosystem of AI innovators, researchers, and enterprises. As we unite under one vision, we’re looking for bold thinkers and agile builders who are excited to shape the future of AI alongside us. If you're passionate about solving complex problems at the intersection of software, hardware, and AI, there's never been a more exciting time to join our team.

Requirements

7+ years of professional software engineering experience (Staff candidates: typically 10+), with meaningful time spent building and operating production systems in Python and/or Go.
A strong track record of designing and evolving non-trivial systems with real performance, reliability, and user impact.
Comfort working across boundaries between developer tools, runtimes, networked services, and data systems.
Experience building reliable systems or pipelines, with good instincts around correctness, resilience, observability, and operational simplicity.
An end-to-end ownership mindset: you can scope problems, make sound tradeoffs, ship, measure, and iterate.
Comfort working in an open, user-facing environment where feedback is direct and quality matters.

Nice To Haves

Hands-on experience with ML workflows, model training, or the tooling researchers use day to day.
Experience with distributed systems, performance-sensitive infrastructure, or large-scale training environments.
Experience with systems instrumentation, observability, or infrastructure that has to work well outside the happy path.
Experience building CLIs, local-first developer tools, or terminal-based workflows.
Prior work on open-source libraries, SDKs, or developer platforms with broad external adoption.
Care deeply about users: You want to talk to real ML researchers and platform engineers, hear about their pain, and go fix it.
Autonomous: You work well in a self-directed environment, proactively find ways to improve things, and can drive a project across team boundaries without needing to be told.
Curious and driven: You're interested in how ML practitioners actually work — the frameworks, the hardware, the failure modes — and you want to keep learning.
Pragmatic: You know when to build the right thing and when to ship the thing that unblocks a customer this week. You can hold both in your head at once.

Responsibilities

Own core parts of the open-source W&B SDK and the systems that carry experiment data from a user's training script into the W&B platform.
Develop the SDK, CLI, and surrounding developer workflows that make W&B intuitive, dependable, and fast in real-world ML development.
Build performance-sensitive runtime systems that capture, buffer, and move data reliably across local, cloud, and large-scale training environments.
Manage the experiment data path end-to-end, with an emphasis on correctness, resilience, and operability.
Create tooling and observability that help understand product usage, diagnose issues quickly, and improve user experience with confidence.
Set technical direction, partner closely with backend, platform, and product teams, talk directly with users, and mentor other engineers.