The role involves taking ownership of problems and working to identify solutions in a fast-paced environment. You will design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaboration with scientists and software/infrastructure engineers is essential to understand infrastructure requirements for training, testing, and deploying machine learning models. You will implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Additionally, you will optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Ensuring security and compliance standards are met throughout the AI/ML infrastructure stack, including data encryption, access control, and vulnerability management, is also a key responsibility. You will troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Staying updated on emerging technologies and best practices in AI/ML infrastructure and evaluating their potential impact on our systems and workflows is crucial. Documentation of infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability is expected.