Nvidia-posted about 1 year ago
$180,000 - $339,250/Yr
Full-time • Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

As a System Software Engineer specializing in LLM Inference and Performance Optimization, you will play a crucial role in advancing AI technologies. This position focuses on optimizing large language models for real-time performance across various hardware platforms, contributing to innovative solutions that shape the future of technology.

  • Design, implement, and optimize inference logic for fine-tuned LLMs, collaborating closely with Machine Learning Engineers.
  • Develop efficient, low-latency glue logic and inference pipelines that are scalable across various hardware platforms, ensuring outstanding performance and minimal resource usage.
  • Apply hardware accelerators such as GPUs and other specialized hardware to enhance inference speed and implement real-world applications effectively.
  • Collaborate with cross-functional teams to integrate models seamlessly into diverse environments, adhering to strict functional and performance requirements.
  • Conduct detailed performance analysis and optimization for specific hardware platforms, focusing on efficiency, latency, and power consumption.
  • 8+ years of expert proficiency in C++ with a deep understanding of memory management, concurrency, and low-level optimizations.
  • M.S. or higher degree (or equivalent experience) in Computer Science/Engineering or a related field.
  • Strong experience in system-level software engineering, including multi-threading, data parallelism, and performance tuning.
  • Validated expertise in LLM inference, with experience in model serving frameworks like ONNX Runtime and TensorRT.
  • Familiarity with real-time systems and performance-tuning techniques, especially for machine learning inference pipelines.
  • Ability to work collaboratively with Machine Learning Engineers and cross-functional teams to align system-level optimizations with model goals.
  • Extensive understanding of hardware architectures and the ability to leverage specialized hardware for optimized ML model inference.
  • Experience with deep learning hardware accelerators, such as Nvidia GPUs.
  • Familiarity with ONNX, TensorRT, or cuDNN for LLM inference on GPU.
  • Experience with low-latency optimizations and real-time system constraints for ML inference.
  • Equity options
  • Comprehensive health insurance
  • Retirement savings plan
  • Paid time off
  • Flexible work hours
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service