This role involves designing, implementing, and maintaining scalable and robust infrastructure for AI/ML model training and inference. The Dev Ops System Administrator will develop and manage CI/CD pipelines for automated building, testing, and deployment of AI applications and machine learning models. Responsibilities include administering and optimizing Linux-based systems and virtualized environments, managing containerization and orchestration platforms (e.g., Docker, Kubernetes) to deploy and scale ML services, and automating infrastructure provisioning, configuration management, and deployment processes using Infrastructure as Code (IaC) tools like Ansible or Terraform. The role also involves managing and allocating GPU resources efficiently for model training and other high-performance computing tasks, implementing and maintaining monitoring, logging, and alerting systems, and collaborating with development teams to support their infrastructure needs and troubleshoot issues.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior