Compute Infrastructure Engineer, AI and Advanced Computing Institute

Schmidt SciencesWashington D.C., WA
20h$150,000 - $170,000

About The Position

Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and Wendy Schmidt that works to accelerate scientific knowledge and breakthroughs with the most promising, advanced tools to support a thriving planet. The organization prioritizes research in areas poised for impact including AI and advanced computing, astrophysics, biosciences, climate, and space—as well as supporting researchers in a variety of disciplines through its science systems program. About the AI & Advanced Computing Institute (“AI Institute”) The AI Institute at Schmidt Sciences is a grantmaking and research group that views AI as a transformative force for scientific discovery and societal progress. Over the next decade, we aim to support key researchers who are working to make AI systems competent, trustworthy, reliable, and able to effectively partner with human scientists on the next generation of discovery. We will also make distinctive investments in beneficial AI areas where philanthropy has a unique advantage. By supporting enabling infrastructure, foundational research, and targeted programs in science disciplines, the AI and Advanced Computing Institute will create the conditions for AI-enabled discovery to achieve its promise. The AI Institute currently has three focus areas: AI for Science – using AI to improve how scientists generate hypotheses, conduct experiments, analyze data, and produce new knowledge – and do this in a way that specifically accelerates the discovery process. Exploratory programs include post-transistor hardware for AI and AI for scientific simulation. Science of AI – understanding and controlling AI systems to mitigate and manage potential risks from advanced AI. Improve AI reliability and performance in areas of limited commercial interest. Existing programs include AI2050 and Science of Trustworthy AI. Exploratory programs include AI interpretability and evolution of multi-agent communication. Beneficial AI – providing scientific foundations and datasets for understanding the larger human impacts of AI. This includes selected high-impact grantmaking, such as our programs to use AI to accelerate humanities research (Humanities and AI Virtual Institute), and to quantify the impact of AI on the labor market (AI@Work). These programs support and draw on related Schmidt Sciences engagements, such as the Virtual Institute for Scientific Software (VISS), which seeks to accelerate the pace of scientific discovery through the support and development of high-quality, community-oriented scientific software. These philanthropic efforts support experienced engineers who are tasked with building open-source infrastructure and applications for multi-disciplinary, high-performance, at-scale scientific discovery. The Role Reporting to the Compute Program Scientist in the AI Institute, the Infrastructure Engineer will provide technical leadership and apply computational infrastructure management expertise across AI Institute and Schmidt Sciences multidisciplinary efforts. The initial set of projects will focus on cutting-edge developments in AI, and a successful candidate should have a work portfolio that reflects specific contributions to deployment and management of appropriately-sized heterogeneous compute clusters in a research or commercial environments. Success for this role is defined by the adherence to the industry’s best DevSecOps practices at scale, and the ability to quickly address computing needs from multiple research teams. This role requires up to 50% domestic travel. Example activities for this role: ● Implementing multiple Authorization and Authentication schemas to accommodate a diverse set of cluster users and applications. ● Managing a mixed-type network storage system with different access models and hardware performance characteristics. ● Designing, implementing, deploying, and maintaining a Continuing Integration Continuous Delivery (CICD) system to address the needs of geographically distributed developers and applications. ● Integrating multiple types of workflow orchestration systems in combination with software dependency package and module management frameworks.

Requirements

  • A Bachelor’s degree from an accredited institution, with a focus on Computer Science, Information Technology, or a related field.
  • 5+ years of professional experience managing production-grade compute clusters.
  • Proficiency with code-management and infrastructure-provisioning tools and best practices.
  • Hands-on experience with workload management using Slurm and Kubernetes.
  • Proficiency with modern machine-learning hosting software frameworks, such as NVIDIA Dynamo, TensorFlow Serving, Ray, etc.
  • Proficiency in building, deployment, and troubleshooting containerized Linux workloads, including GPU-accelerated configurations.
  • In-depth knowledge of data center networking technologies and solutions.
  • Understanding of the tech stack needed to design, train, deploy, and maintain state-of-the-art AI models at a production scale.
  • Experience producing technical writing for expert and general audiences.
  • Good track record of collaborative impact in high-intensity, team-based environments.
  • Sense of controlled urgency in driving work to completion.
  • The highest integrity and ability to maintain confidentiality.
  • Be able to travel within the U.S. and internationally on a regular basis as needed.

Nice To Haves

  • Expert-level experience and industry credentials in the software and hardware frameworks that drive modern AI, and competence in at least one, and preferably multiple, fields of science impacted by modern AI.
  • Prior leadership of data center infrastructure initiatives and projects, such as evaluating hardware scalability, securing data, or executing large-scale upgrades.
  • Expertise in relevant technical focus areas, e.g., AI model performance monitoring or network and storage optimization, etc.
  • Ability to work with and effectively translate technical concepts across multiple scientific disciplines.
  • Ability to critically evaluate scientific and technical publications and emerging methods in related disciplines.
  • Experience working with science-focused institutions such as philanthropic organizations or academic/government research institutions.

Responsibilities

  • Continually identify, evaluate, and deploy open source and proprietary technologies that meet the combined infrastructure requirements from an evolving list of projects.
  • Implement performant solutions that meet industry compliance and security standards while enabling rapid development workflows.
  • Collaborate with hardware and software vendors on methods and configurations that maximize system resource utilization.
  • Assist existing programs by providing infrastructure management advice, while working closely and collaboratively with multiple subject matter expert teams.
  • Work with other members of the Schmidt Sciences technical team to implement new deployment strategies and application hosting capabilities in support of diverse applications and user audiences.
  • Maintain awareness and track industry trends for hardware and software tooling that simplifies infrastructure management, while lowering the cost of deploying and supporting multi-tenant research applications.
  • Participate in relevant industry events and forums, representing Schmidt Sciences’ presence on AI and advanced computing issues.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service