Oracle-posted 3 months ago
$96,800 - $223,400/Yr
Mid Level
Nashville, TN
Professional, Scientific, and Technical Services

The role involves taking ownership of problems and working to identify solutions in a fast-paced environment. You will design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaboration with scientists and software/infrastructure engineers is essential to understand infrastructure requirements for training, testing, and deploying machine learning models. You will implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Additionally, you will optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Ensuring security and compliance standards are met throughout the AI/ML infrastructure stack, including data encryption, access control, and vulnerability management, is also a key responsibility. You will troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Staying updated on emerging technologies and best practices in AI/ML infrastructure and evaluating their potential impact on our systems and workflows is crucial. Documentation of infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability is expected.

  • Take ownership of problems and work to identify solutions.
  • Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows.
  • Collaborate with scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models.
  • Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity.
  • Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques.
  • Ensure security and compliance standards are met throughout the AI/ML infrastructure stack, including data encryption, access control, and vulnerability management.
  • Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime.
  • Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows.
  • Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability.
  • Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes.
  • Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems.
  • Solid understanding of networking concepts, security principles, and best practices.
  • Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment.
  • Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders.
  • Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps.
  • Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.
  • Strong proficiency in at least one of the programming languages such as Python, Rust, Go, Java, or Scala.
  • Proven experience designing, implementing, and managing infrastructure for AI/ML or HPC workloads.
  • Understanding machine learning frameworks and libraries such as TensorFlow, PyTorch, or sci-kit-learn and their deployment in production environments.
  • Familiarity with DevOps practices and tools for continuous integration, deployment, and monitoring (e.g., Jenkins, GitLab CI/CD, Prometheus).
  • Strong experience with High-Performance Computing systems.
  • Medical, dental, and vision insurance, including expert medical opinion.
  • Short term disability and long term disability.
  • Life insurance and AD&D.
  • Supplemental life insurance (Employee/Spouse/Child).
  • Health care and dependent care Flexible Spending Accounts.
  • Pre-tax commuter and parking benefits.
  • 401(k) Savings and Investment Plan with company match.
  • Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position.
  • 11 paid holidays.
  • Paid sick leave: 72 hours of paid sick leave upon date of hire.
  • Paid parental leave.
  • Adoption assistance.
  • Employee Stock Purchase Plan.
  • Financial planning and group legal.
  • Voluntary benefits including auto, homeowner and pet insurance.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service