Inference Optimization Engineer (local / edge runtime)

IntelHillsboro, CA
$170,500 - $315,490Hybrid

About The Position

Our Mission At Intel, our journey is to transform AI into something safer, more trustworthy, and respectful of human privacy by design. We believe transformative AI should have a positive impact on people—powerful in capability, yet honest about its limits and protective of the data and resources it touches. To get there, we build agentic AI that combines the best of local and cloud intelligence — private, affordable, and sustainable by design. Small, efficient models run directly on the user's machine (AI PC, edge, on-prem, and beyond), keeping data private and token costs low, while powerful cloud models handle the hardest work: planning, reasoning, and complex problem-solving. Today, neither approach can deliver this alone. Together, they give people real capability without compromise—data stays private, spend stays predictable, and energy use stays in check. We're building intelligence that scales without sacrificing trust, cost, or the planet—because the future of AI should belong to the people it serves. Role Summary Make models fast on the hardware people actually own. You optimize inference engines (llama.cpp, vLLM) for constrained local and edge environments — GPU/iGPUs, Vulkan backends — not datacenter H100 environment, mostly PC/edge. KV cache, batching, quantization, scheduling, and CPU-overhead reduction are your daily tools. This is the rare skill that makes a hybrid, low-cost agent product viable.

Requirements

  • BS/MS in CS, EE, Math or related STEM field
  • 5+ years software development background
  • Strong in C++ and/or Python; comfortable reading systems-level code
  • Understands how LLM inference works (attention, KV cache, decoding)
  • Has profiled and optimized real performance problems (CPU or GPU) and can prove the speedup
  • Linux, build systems, and low-level debugging expertise

Nice To Haves

  • Hands-on with llama.cpp, vLLM, ggml, or similar engines
  • Experience with GPU / accelerator programming (Vulkan, CUDA, SYCL, Metal) or SIMD / CPU kernels
  • Familiarity with quantization formats and their quality trade-offs
  • Open-source contributions to inference engines

Responsibilities

  • Profile and optimize local inference (llama.cpp-vulkan and vLLM) for latency, throughput, and memory on edge hardware
  • Tune KV cache, continuous batching, and scheduling for interactive agent workloads
  • Drive quantization strategy (GGUF / AWQ / GPTQ) and validate quality impact with the Post-Training team
  • Cut CPU overhead and improve engine startup, model load, and lifecycle (start / stop / health)
  • Benchmark across hardware tiers and publish honest performance comparisons
  • Upstream fixes and patches to open-source engines where it helps us

Benefits

  • competitive pay
  • stock bonuses
  • health
  • retirement
  • vacation
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service