Machine Learning Engineer (Model Efficiency & Interpretability)

Deeproute.ai•Fremont, CA

About The Position

We are looking for engineers who go beyond “training bigger models.” You will focus on understanding what happens inside models, improving efficiency, reliability, and interpretability—often without relying on massive compute.

Requirements

Strong foundation in deep learning and neural network architectures.
Hands-on experience with model efficiency optimization (quantization, pruning, distillation, etc.).
Experience working under resource constraints (edge devices, real-time systems, or low-latency services).
Demonstrated ability to analyze model internals, not just train models.
Experience with weight / activation distribution analysis
Experience with debugging model behavior beyond metrics
Experience with understanding why a model works or fails

Nice To Haves

Experience with model compression or deployment frameworks (TensorRT, ONNX, TVM, etc.)
Experience with numerical stability / low-precision training
Experience with interpretability or mechanistic analysis of neural networks
Prior work showing deep investigation into model behavior, not just scaling experiments.

Responsibilities

Design and optimize lightweight neural networks (e.g., ShuffleNet, EfficientNet) for high parameter efficiency and real-time performance.
Improve latency, memory footprint, and throughput under real-world constraints (on-device / real-time systems).
Apply and extend techniques such as quantization, pruning, distillation, and operator-level optimization.
Analyze model weights, activations, and internal representations to understand decision mechanisms.
Investigate failure cases and error patterns, especially under distribution shift or long-tail scenarios.
Develop tools or methods to attribute model behavior (e.g., neuron-level analysis, feature attribution, representation probing).
Study and improve robustness of models under transformations such as quantization or compression.
Diagnose and mitigate performance degradation caused by quantization or reduced precision.
Analyze weight/activation distributions and sensitivity to precision changes.
Design improved quantization strategies to maintain accuracy under strict compute constraints.
Dive deep into model execution to identify bottlenecks at the kernel / operator / graph level.
Build experiments to validate hypotheses about model behavior, rather than relying on brute-force scaling.
Maintain a strong focus on measurable improvements (latency, memory, stability, error rates).