Roblox is reimagining the way people come together to connect, create, and express themselves. To support our massive scale, we are powered by microservices. The Application Networking team connects and secures these services, building ingress gateways that route traffic from the edge to core data centers, and managing the service mesh. We are building a specialized high-performance compute cluster for Machine Learning inference and training. You will join our Kubernetes networking squad to support critical production workloads, and help us scale, harden, and operationalize the infrastructure that powers the next generation of AI on Roblox. You will: Take ownership of critical networking components, transitioning them from proof-of-concept to production-ready systems capable of handling massive ML throughput Build the observability, alerting, and tooling required to maintain 99.99% reliability for our ML clusters. Deep dive into Cilium and eBPF to squeeze maximum performance out of the network, reducing latency for training jobs. Collaborate with our technical leads to implement complex networking features, ensuring high code quality and robust design. Serve as a primary escalation point for complex network issues in the Kubernetes layer, troubleshooting across the stack (kernel, CNI, service mesh). Act as a senior voice on the team, mentoring junior engineers and promoting best practices in testing, deployment, and reliability engineering. You have: A professional with 5+ years of experience, with a strong focus on running Kubernetes in production environments. You know what breaks at scale and how to prevent it. Deeply knowledgeable of the Kubernetes networking model (CNI, Services, Ingress, Kube-proxy) and have managed clusters with significant traffic. Familiar with Cilium and eBPF and comfortable working with Linux networking fundamentals (TCP/IP, routing tables, iptables/nftables). Excellent at taking a high-level architecture strategy and turning it into working, tested, and deployable code. Fluent in C/C++, Go, or Rust. Experienced with on-call rotations and approach infrastructure with a "reliability-first" mindset.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Education Level
No Education Listed
Number of Employees
1,001-5,000 employees