FAR.AI-posted 3 months ago
$100,000 - $175,000/Yr
Full-time • Entry Level
Berkeley, CA
11-50 employees

We’re seeking an Infrastructure Engineer to develop and manage scalable infrastructure to support our research workloads. You will own our existing Kubernetes cluster, deployed on top of bare-metal H100 cloud instances. You will oversee and enhance the cluster to 1) support new workloads, such as multi-node LoRA training; 2) new users, as we double the size of our research team in the next twelve to eighteen months; and 3) new features, such as fine-grained experiment compute usage tracking. You will be the point-person for cluster-related work. You will work on the Foundations team alongside experienced engineers, including those who built and designed the cluster, who can provide guidance and backup. However, as our first dedicated infrastructure hire, you will need to work autonomously, design solutions to varied and complex problems, and communicate with researchers who are technically skilled but less knowledgeable about our cluster and infrastructure. This is an opportunity to build the technical foundations of the largest independent AI safety research institute, with one of the most varied research agendas. You will be working directly with both the Foundations team and researchers across the organization to enable bleeding-edge research workloads across our research portfolio.

  • Deliver a scalable and easy to use compute cluster to support impactful research.
  • Empower the research team to solve their own day-to-day compute problems, such as debugging simple issues and streamlining recurring tasks.
  • Maintain and develop in-cluster services, such as backups, experiment tracking, and our in-house LLM-based cluster support bot.
  • Maintain adequate cluster stability to avoid interfering with research workloads.
  • Maintain situational awareness of the cloud GPU market and assist leadership with vendor comparisons.
  • Implement security measures to secure the cluster against insider and external threats.
  • Make secure workflows the default.
  • Champion security across the FAR.AI team, including maintaining and extending our mobile device management (MDM) system.
  • Work with the Foundations team and specific research teams to support novel ML workloads.
  • Architect our Kubernetes cluster to flexibly support novel workloads.
  • Improve observability over cluster resources and GPU utilization.
  • Have Kubernetes or other system administration experience.
  • Have a curiosity and willingness to rapidly learn the needs of a new space.
  • Are self-directed and comfortable with ambiguous or rapidly evolving requirements.
  • Are willing to be on-call during waking hours for cluster issues ahead of major deadlines.
  • Are interested in improving our security posture through identifying, implementing and administering security policies.
  • Have experience supporting ML/AI workloads.
  • Have previously worked in research environments or startups.
  • Are experienced in administering compute or GPU clusters.
  • Are able to adopt a security mindset.
  • Are willing to be part of an eventual on-call rotation, if required.
  • Compensation: $100,000-$175,000/year depending on experience and location.
  • We will pay for work-related travel and equipment expenses.
  • Catered lunch and dinner at our offices in Berkeley.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service