Senior AI Engineer

Zoom•San Jose, CA

50d•$209,000 - $275,400•Hybrid

About The Position

Immigration sponsorship is not available for this position. Responsibilities: • Develop the Machine Learning Platform management system. • Design and implement intuitive user interfaces and APls for seamless interaction with the platform. • Ensure robust access control and security measures for the Machine Learning Platform. • Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration. • Build the toolchains, service, pipeline for model development workflow, and model serving architecture. • Create automated pipelines for data preprocessing, feature engineering, and dataset versioning. • Develop Cl/CD pipelines for deploying models into production environments with minimal downtime. • Enable support for distributed model training and hyperparameter optimization. • Incorporate A/B testing frameworks for evaluating multiple model deployments. • Collaborate with data scientists and engineers to streamline the model development lifecycle. • Prioritize various metrics for model training and inferencing monitoring. Implement logging and monitoring tools to track model performance, resource utilization, and throughput. • Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime. • Establish alerting mechanisms to detect and respond to anomalies or performance degradation. • Continuously refine metric prioritization based on stakeholder feedback and evolving business goals. • Develop and maintaining the high-performance LLM training GPU infrastructure and cluster. • Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage. • Implement fault-tolerant and distributed training strategies for handling large language models (LLMs). • Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure. • Regularly update cluster configurations to support new frameworks and model architectures. • Manage scheduling and resource allocation for multi-tenant GPU clusters. • Understand the auto scale for inference service and multi-models for dynamical loading. • Design systems that dynamically allocate resources based on real-time demand for inference services. • Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage. • Implement strategies for caching frequently used models to improve inference performance. • Experiment with serverless architectures to further enhance scalability and cost efficiency. • Ensure compatibility with edge devices and deploy lightweight models for edge inference. • Support, troubleshoot, and resolve any issues during the training and inferencing. • Create detailed runbooks for common troubleshooting scenarios to reduce resolution times. • Perform root cause analysis for failures and implement long-term fixes to prevent recurrence. • Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure. • Develop self-healing systems that can automatically recover from common training or inference issues. • Provide technical support and guidance to data scientists and engineers working on the platform.

Requirements

Requires a Bachelor's degree in Communications Engineering, Artificial Intelligence, Software Engineering, a related field, or a foreign degree equivalent.
Must have 2 years of experience in job offered or related occupation.
Must have 2 years of experience in: Designing, Implementing, or optimizing large-scale distributed training systems using technologies like Horovod, DeepSpeed, PyTorch Distributed, or Ray;
Tensor/model parallelism and pipeline parallelism;
Utilizing cloud-native or on-prem infrastructure (Kubernetes, Docker, Slurm) to support scalable, fault-tolerant, and resource-efficient AI workloads across multi-node GPU clusters;
Using Performance Profiling and Optimization to diagnose and improve end-to-end training performance by optimizing data pipelines (e.g., DALI, tf.data), minimizing communication overhead (e.g., NCCL, gRPC), and tuning hardware-specific kernels (e.g., CUDA, Triton);
Systems Programming and Automation in systems-level programming with Python, Bash, and C++ or Go;
Automating deployment and orchestration of AI workloads and monitoring using Prometheus, Grafana, Weights & Biases.

Responsibilities

Develop the Machine Learning Platform management system.
Design and implement intuitive user interfaces and APls for seamless interaction with the platform.
Ensure robust access control and security measures for the Machine Learning Platform.
Regularly evaluate and enhance platform performance, scalability, and reliability.
Integrate tools for data versioning, experiment tracking, and workflow orchestration.
Build the toolchains, service, pipeline for model development workflow, and model serving architecture.
Create automated pipelines for data preprocessing, feature engineering, and dataset versioning.
Develop Cl/CD pipelines for deploying models into production environments with minimal downtime.
Enable support for distributed model training and hyperparameter optimization.
Incorporate A/B testing frameworks for evaluating multiple model deployments.
Collaborate with data scientists and engineers to streamline the model development lifecycle.
Prioritize various metrics for model training and inferencing monitoring.
Implement logging and monitoring tools to track model performance, resource utilization, and throughput.
Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime.
Establish alerting mechanisms to detect and respond to anomalies or performance degradation.
Continuously refine metric prioritization based on stakeholder feedback and evolving business goals.
Develop and maintaining the high-performance LLM training GPU infrastructure and cluster.
Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage.
Implement fault-tolerant and distributed training strategies for handling large language models (LLMs).
Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure.
Regularly update cluster configurations to support new frameworks and model architectures.
Manage scheduling and resource allocation for multi-tenant GPU clusters.
Understand the auto scale for inference service and multi-models for dynamical loading.
Design systems that dynamically allocate resources based on real-time demand for inference services.
Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage.
Implement strategies for caching frequently used models to improve inference performance.
Experiment with serverless architectures to further enhance scalability and cost efficiency.
Ensure compatibility with edge devices and deploy lightweight models for edge inference.
Support, troubleshoot, and resolve any issues during the training and inferencing.
Create detailed runbooks for common troubleshooting scenarios to reduce resolution times.
Perform root cause analysis for failures and implement long-term fixes to prevent recurrence.
Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure.
Develop self-healing systems that can automatically recover from common training or inference issues.
Provide technical support and guidance to data scientists and engineers working on the platform.

Benefits

As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume