AI Infrastructure & Experience Engineer

FocusKPI Inc.•Mountain View, CA

2d•$70 - $79•Onsite

About The Position

FocusKPI is seeking an AI Infrastructure & Experience Engineer to join one of our clients, a high-tech SaaS company. This is an onsite role in Mountain View, CA, requiring 5 days/week onsite for a 4-month contract. The role involves deploying and optimizing LLMs and generative multimodal models, leveraging deep knowledge of CUDA for custom kernel development, and integrating inference backends with orchestration layers and frontends. The engineer will also be responsible for rapid prototyping of AI-driven features and implementing communication protocols to connect local AI compute with peripheral devices.

Requirements

Recent experience in model optimization is required.
Proven experience with NVIDIA ecosystems and ARM64 architecture.
Advanced proficiency in C++, Python, and Rust.
Deep familiarity with CUDA and the ability to author/debug custom CUDA kernels for compute-intensive tasks.
Extensive experience with modern inference engines (llama.cpp, TensorRT-LLM, Ollama) and orchestration frameworks (LiteLLM).
Robust understanding of asynchronous programming (FastAPI), containerization (Docker/Kubernetes), sandbox environments, and API design for low-latency communication.
Ability to quickly spin up modern frontend UIs (React, Next.js, or similar) to present AI-driven intelligence to end users.
Familiarity with WebSockets, gRPC, and REST for device-to-device communication in a local network environment.
Model optimization recent experience.
Interference Optimization.
NVIDIA ecosystems.
Custom CUDA Kernel Development.
ARM64 architecture.
Python.

Nice To Haves

A minimum of 3 years of relevant industry experience is required.
The "Builder" Mindset: Energized by building proofs-of-concept in days rather than months; thrives in environments where speed and creativity are paramount.
Problem Solver: Approaches unsolved, messy engineering challenges with enthusiasm.
Architectural Vision: Sees the "big picture" of how AI becomes part of consumers' daily lives.
Agile & Adaptable: Comfortable working in a fast-paced environment where priorities shift based on rapid experimentation.
Degree in Computer Science, Machine Learning, or Artificial Intelligence Specialization preferred, but not required.

Responsibilities

Deploy and tune multiple LLMs and generative multimodal models on local inference hardware.
Optimize performance metrics (TTFT, tokens/sec) via model quantization, caching strategies, and architecture-specific adjustments.
Leverage deep knowledge of the CUDA environment to build custom kernels, ensuring maximum utilization of the low-cost GPU compute.
Seamlessly bridge inference backends with orchestration layers (LiteLLM, Ollama, etc.) and frontends like OpenWebUI.
Build functional, high-fidelity demos showcasing model memory capabilities, agentic workflows, and context-aware web search.
Implement communication protocols to bridge local AI compute with peripheral devices, including smart TVs, household appliances, and XR hardware.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume