The Scaling team designs, builds, and operates critical infrastructure that enables research at OpenAI. Our mission is simple: accelerate the progress of research towards AGI. We do this by building core systems that researchers rely on - ranging from low-level infrastructure components to research-facing custom applications. These systems must scale with the increasing complexity and size of our workloads, while remaining reliable and easy to use. We're looking for an experienced Site Reliability Engineer to own production-critical infrastructure end to end. This role is centered on data-heavy, low-latency workloads, with emphasis on operating large-scale ClickHouse clusters, high-throughput Kafka pipelines, and reliable integrations with Snowflake. You'll turn ambiguous operational problems into clear plans, ship pragmatic solutions quickly, and improve them through production feedback and iteration. We are specifically looking for someone who can independently define and raise operational standards across teams while remaining deeply hands-on in production systems.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed
Number of Employees
1-10 employees