Software Engineer, Infrastructure - Analytics Platform

OpenAISan Francisco, CA
Hybrid

About The Position

About the Team The Scaling team designs, builds, and operates critical infrastructure that enables research at OpenAI. Our mission is simple: accelerate the progress of research towards AGI. We do this by building core systems that researchers rely on - ranging from low-level infrastructure components to research-facing custom applications. These systems must scale with the increasing complexity and size of our workloads, while remaining reliable and easy to use. About the Role We’re looking for a staff-level software engineer to own production-critical infrastructure end to end. This role is centered on backend / systems engineering, with emphasis on low-level performance, distributed systems, and hands-on operation of critical services at scale. You’ll take ambiguous problems, turn them into concrete plans, ship pragmatic solutions quickly, and improve them through production feedback and iteration. This is not a general Python backend role. We’re specifically looking for strong systems experience in Rust or C++, especially in performance-sensitive infrastructure. This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

Requirements

  • A track record of owning operationally critical systems end to end and delivering outcomes in ambiguous environments.
  • Strong hands-on experience building performance-sensitive backend systems in Rust or C++.
  • Comfort working below typical service abstractions, including concurrency, async execution, memory behavior, serialization, I/O, networking, profiling, and failure analysis.
  • Experience designing, building, or operating distributed systems or distributed databases at meaningful scale.
  • Hands-on experience operating production-critical systems, including incidents, observability, rollout safety, and recurrence prevention.
  • Strong judgment in balancing engineering quality, speed, risk, and business impact.
  • A habit of shipping practical first versions and improving them through production feedback.

Nice To Haves

  • Preferably, experience with ClickHouse-like systems or infrastructure for analytics, telemetry, logging, search, ingestion, storage, or query execution.

Responsibilities

  • Own critical infrastructure across design, implementation, rollout, operation, and iteration.
  • Build and operate performant backend systems in Rust or C++ that support core research workflows.
  • Design and improve distributed data and serving systems, including tradeoffs around partitioning, replication, consistency, retries, backpressure, and failure isolation.
  • Debug real production bottlenecks across latency, throughput, contention, hot spots, and overload behavior.
  • Operate business-critical services through on-call, incidents, postmortems, observability, rollout safety, and zero-downtime migrations.
  • Improve reliability of services running on Kubernetes, including resource tuning and failure handling.
  • Partner closely with engineers and researchers to deliver fast, reliable, useful systems.
  • Raise the bar through strong technical judgment, ownership, and follow-through.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

1-10 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service