AI Optimization Engineer - ONSITE

Simple Solutions•Jersey City, NJ

1d•Onsite

About The Position

The role of AI Optimization Engineer involves supporting large-scale AI/ML and Generative AI workloads within an enterprise setting. This position is primarily focused on the optimization, deployment, and management of machine learning and large language models (LLMs) on GPU-accelerated High-Performance Computing (HPC) infrastructure. The successful candidate will possess significant experience in Python-based machine learning, deep learning frameworks, various model optimization techniques, and the development of scalable AI infrastructure. The engineer will collaborate closely with AI, infrastructure, and DevOps teams to devise efficient pipelines for model training and inference, implement SLURM-based workload orchestration, and deploy containerized ML solutions into production environments. Key responsibilities include enhancing model performance through methods such as pruning, quantization, and knowledge distillation, overseeing inference workflows using Triton Inference Server, and monitoring system performance with Prometheus and Grafana. The position demands practical experience with HPC environments, GPU clusters, containerization technologies, and Linux system administration, alongside a robust understanding of machine learning algorithms, deep learning architectures, and contemporary AI development tools. Prior experience with cloud platforms, vector embedding, and enterprise-scale AI deployments is considered a strong advantage.

Requirements

Strong experience in Python-based machine learning and deep learning, including NumPy, scikit-learn, TensorFlow, PyTorch, and Keras.
Hands-on knowledge of supervised and unsupervised learning, neural networks, transformer-based models, NLP, CNNs, and Generative AI concepts.
Expertise in AI infrastructure and optimization, including HPC environments, GPU clusters, SLURM workload management, Triton Inference Server, TRTLLM, and model optimization techniques such as pruning, quantization, and distillation for scalable LLM deployment.
Experience with DevOps and deployment tools such as Docker, Kubernetes, MLFlow, Terraform, Jenkins, GitHub, and HuggingFace.
Strong skills in performance monitoring using Prometheus and Grafana.
Flask API development.
Linux administration (RHEL/CentOS).
Experience with container runtimes like Enroot, Pyxis, and Podman.
Experience with data analysis and visualization tools such as Plotly, Seaborn, and Matplotlib.

Nice To Haves

Experience with cloud platforms
Experience with vector embedding
Experience with enterprise-scale AI deployments

Responsibilities

Design and optimize AI/ML workloads on GPU-based HPC clusters.
Deploy and manage large language models (LLMs) in scalable production environments.
Implement model optimization techniques including pruning, quantization, and knowledge distillation.
Develop and manage automated job scheduling using SLURM with REST and Flask APIs.
Deploy ML models using containerized microservices architectures.
Monitor system performance using Prometheus and Grafana.
Optimize inference pipelines using Triton Inference Server and TRTLLM.
Conduct exploratory data analysis and model performance evaluation.
Collaborate with infrastructure and ML teams to improve scalability and efficiency.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume