Senior AI Engineer

ZoomSan Jose, CA
$209,000 - $275,400Hybrid

About The Position

Immigration sponsorship is not available for this position. Responsibilities: • Develop the Machine Learning Platform management system. • Design and implement intuitive user interfaces and APls for seamless interaction with the platform. • Ensure robust access control and security measures for the Machine Learning Platform. • Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration. • Build the toolchains, service, pipeline for model development workflow, and model serving architecture. • Create automated pipelines for data preprocessing, feature engineering, and dataset versioning. • Develop Cl/CD pipelines for deploying models into production environments with minimal downtime. • Enable support for distributed model training and hyperparameter optimization. • Incorporate A/B testing frameworks for evaluating multiple model deployments. • Collaborate with data scientists and engineers to streamline the model development lifecycle. • Prioritize various metrics for model training and inferencing monitoring. Implement logging and monitoring tools to track model performance, resource utilization, and throughput. • Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime. • Establish alerting mechanisms to detect and respond to anomalies or performance degradation. • Continuously refine metric prioritization based on stakeholder feedback and evolving business goals. • Develop and maintaining the high-performance LLM training GPU infrastructure and cluster. • Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage. • Implement fault-tolerant and distributed training strategies for handling large language models (LLMs). • Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure. • Regularly update cluster configurations to support new frameworks and model architectures. • Manage scheduling and resource allocation for multi-tenant GPU clusters. • Understand the auto scale for inference service and multi-models for dynamical loading. • Design systems that dynamically allocate resources based on real-time demand for inference services. • Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage. • Implement strategies for caching frequently used models to improve inference performance. • Experiment with serverless architectures to further enhance scalability and cost efficiency. • Ensure compatibility with edge devices and deploy lightweight models for edge inference. • Support, troubleshoot, and resolve any issues during the training and inferencing. • Create detailed runbooks for common troubleshooting scenarios to reduce resolution times. • Perform root cause analysis for failures and implement long-term fixes to prevent recurrence. • Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure. • Develop self-healing systems that can automatically recover from common training or inference issues. • Provide technical support and guidance to data scientists and engineers working on the platform.

Requirements

  • Requires a Bachelor's degree in Communications Engineering, Artificial Intelligence, Software Engineering, a related field, or a foreign degree equivalent.
  • Must have 2 years of experience in job offered or related occupation.
  • Must have 2 years of experience in: Designing, Implementing, or optimizing large-scale distributed training systems using technologies like Horovod, DeepSpeed, PyTorch Distributed, or Ray;
  • Tensor/model parallelism and pipeline parallelism;
  • Utilizing cloud-native or on-prem infrastructure (Kubernetes, Docker, Slurm) to support scalable, fault-tolerant, and resource-efficient AI workloads across multi-node GPU clusters;
  • Using Performance Profiling and Optimization to diagnose and improve end-to-end training performance by optimizing data pipelines (e.g., DALI, tf.data), minimizing communication overhead (e.g., NCCL, gRPC), and tuning hardware-specific kernels (e.g., CUDA, Triton);
  • Systems Programming and Automation in systems-level programming with Python, Bash, and C++ or Go;
  • Automating deployment and orchestration of AI workloads and monitoring using Prometheus, Grafana, Weights & Biases.

Responsibilities

  • Develop the Machine Learning Platform management system.
  • Design and implement intuitive user interfaces and APls for seamless interaction with the platform.
  • Ensure robust access control and security measures for the Machine Learning Platform.
  • Regularly evaluate and enhance platform performance, scalability, and reliability.
  • Integrate tools for data versioning, experiment tracking, and workflow orchestration.
  • Build the toolchains, service, pipeline for model development workflow, and model serving architecture.
  • Create automated pipelines for data preprocessing, feature engineering, and dataset versioning.
  • Develop Cl/CD pipelines for deploying models into production environments with minimal downtime.
  • Enable support for distributed model training and hyperparameter optimization.
  • Incorporate A/B testing frameworks for evaluating multiple model deployments.
  • Collaborate with data scientists and engineers to streamline the model development lifecycle.
  • Prioritize various metrics for model training and inferencing monitoring.
  • Implement logging and monitoring tools to track model performance, resource utilization, and throughput.
  • Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime.
  • Establish alerting mechanisms to detect and respond to anomalies or performance degradation.
  • Continuously refine metric prioritization based on stakeholder feedback and evolving business goals.
  • Develop and maintaining the high-performance LLM training GPU infrastructure and cluster.
  • Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage.
  • Implement fault-tolerant and distributed training strategies for handling large language models (LLMs).
  • Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure.
  • Regularly update cluster configurations to support new frameworks and model architectures.
  • Manage scheduling and resource allocation for multi-tenant GPU clusters.
  • Understand the auto scale for inference service and multi-models for dynamical loading.
  • Design systems that dynamically allocate resources based on real-time demand for inference services.
  • Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage.
  • Implement strategies for caching frequently used models to improve inference performance.
  • Experiment with serverless architectures to further enhance scalability and cost efficiency.
  • Ensure compatibility with edge devices and deploy lightweight models for edge inference.
  • Support, troubleshoot, and resolve any issues during the training and inferencing.
  • Create detailed runbooks for common troubleshooting scenarios to reduce resolution times.
  • Perform root cause analysis for failures and implement long-term fixes to prevent recurrence.
  • Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure.
  • Develop self-healing systems that can automatically recover from common training or inference issues.
  • Provide technical support and guidance to data scientists and engineers working on the platform.

Benefits

  • As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service