Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything. About the Role We're hiring an engineer to ensure the reliability of our GPU supercomputing fleet, owning the seam between hardware, firmware, and operating system. You will track the long tail of hardware issues: We are conducting frontier research in AI and a single bad NIC, HBM or a kernel driver edge case can compromise an experiment. Your job is to diagnose these issues, track their root cause down to the hardware, and resolve them internally or directly with vendors so that our researchers can run at scale and with confidence. Note: This is an "evergreen role" that we keep open on an on-going basis to express interest. We receive many applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Still, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities open. You are welcome to reapply if you get more experience, but please avoid applying more than once every 6 months. You may also find that we put up postings for singular roles for separate, project or team specific needs. In those cases, you're welcome to apply directly in addition to an evergreen role.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
Associate degree