Distributed Systems Engineer

KreaSan Francisco, CA
1d

About The Position

At Krea, we are building next-generation AI creative tools. We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it. We believe AI is a new medium that allows us to express ourselves through various formats—text, images, video, sound, and even 3D. We're building better, smarter, and more controllable tools to harness this medium. This job Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real-time user experiences, and large-scale model deployments. As a Distributed Systems Engineer, you will… … design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving. … own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations. … collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment. … improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments.

Requirements

  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS/GCP/Azure)
  • High-performance and fault-tolerant networking
  • Low-level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages

Nice To Haves

  • Infrastructure as Code (e.g. Terraform)

Responsibilities

  • design, build, and maintain large-scale distributed infrastructure to reliably support AI research and real-time model serving.
  • own and scale our multi-thousand-node Kubernetes GPU clusters, ensuring efficient and fault-tolerant operations.
  • collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.
  • improve network architecture, optimize load balancing, and streamline operational practices across multi-zone cloud deployments.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service