Immigration sponsorship is not available for this position. Responsibilities: • Develop the Machine Learning Platform management system. • Design and implement intuitive user interfaces and APls for seamless interaction with the platform. • Ensure robust access control and security measures for the Machine Learning Platform. • Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration. • Build the toolchains, service, pipeline for model development workflow, and model serving architecture. • Create automated pipelines for data preprocessing, feature engineering, and dataset versioning. • Develop Cl/CD pipelines for deploying models into production environments with minimal downtime. • Enable support for distributed model training and hyperparameter optimization. • Incorporate A/B testing frameworks for evaluating multiple model deployments. • Collaborate with data scientists and engineers to streamline the model development lifecycle. • Prioritize various metrics for model training and inferencing monitoring. Implement logging and monitoring tools to track model performance, resource utilization, and throughput. • Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime. • Establish alerting mechanisms to detect and respond to anomalies or performance degradation. • Continuously refine metric prioritization based on stakeholder feedback and evolving business goals. • Develop and maintaining the high-performance LLM training GPU infrastructure and cluster. • Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage. • Implement fault-tolerant and distributed training strategies for handling large language models (LLMs). • Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure. • Regularly update cluster configurations to support new frameworks and model architectures. • Manage scheduling and resource allocation for multi-tenant GPU clusters. • Understand the auto scale for inference service and multi-models for dynamical loading. • Design systems that dynamically allocate resources based on real-time demand for inference services. • Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage. • Implement strategies for caching frequently used models to improve inference performance. • Experiment with serverless architectures to further enhance scalability and cost efficiency. • Ensure compatibility with edge devices and deploy lightweight models for edge inference. • Support, troubleshoot, and resolve any issues during the training and inferencing. • Create detailed runbooks for common troubleshooting scenarios to reduce resolution times. • Perform root cause analysis for failures and implement long-term fixes to prevent recurrence. • Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure. • Develop self-healing systems that can automatically recover from common training or inference issues. • Provide technical support and guidance to data scientists and engineers working on the platform.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior