Staff Machine Learning Engineer

TalTeam-posted about 2 months ago

Full-time • Mid Level

Onsite • San Jose, CA

101-250 employees

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

What you'll do (Responsibilities) Own the technical roadmap for Verilog/RTL‐focused LLM capabilities—from model selection and adaptation to evaluation, deployment, and continuous improvement. Lead a hands‐on team of applied scientists/engineers: set direction, unblock technically, review designs/code, and raise the bar on experimentation velocity and reliability. Fine‐tune and customize models using state‐of‐the‐art techniques (LoRA/QLoRA, PEFT, instruction tuning, preference optimization/RLAIF) with robust HDL-specific evals: Compile‐/lint‐/simulate‐based pass rates, pass@k for code generation, constrained decoding to enforce syntax, and "does‐it‐synthesize” checks. Design privacy‐first ML pipelines on AWS: Training/customization and hosting using Amazon Bedrock (including Anthropic models) where appropriate; SageMaker (or EKS + KServe/Triton/DJL) for bespoke training needs. Artifacts in S3 with KMS CMKs; isolated VPC subnets & PrivateLink (including Bedrock VPC endpoints), IAM least‐privilege, CloudTrail auditing, and Secrets Manager for credentials. Enforce encryption in transit/at rest, data minimization, no public egress for customer/RTL corpora. Stand up dependable model serving: Bedrock model invocation where it fits, and/or low‐latency self-hosted inference (vLLM/TensorRT‐LLM), autoscaling, and canary/blue-green rollouts. Build an evaluation culture: automatic regression suites that run HDL compilers/simulators, measure behavioral fidelity, and detect hallucinations/constraint violations; model cards and experiment tracking (MLflow/Weights & Biases). Partner deeply with hardware design, CAD/EDA, Security, and Legal to source/prepare datasets (anonymization, redaction, licensing), define acceptance gates, and meet compliance requirements. Drive productization: integrate LLMs with internal developer tools (IDEs/plug‐ins, code review bots, CI), retrieval (RAG) over internal HDL repos/specs, and safe tool‐use/function-calling. Mentor & uplevel: coach ICs on LLM best practices, reproducible training, critical paper reading, and building secure‐by‐default systems.

Own the technical roadmap for Verilog/RTL‐focused LLM capabilities—from model selection and adaptation to evaluation, deployment, and continuous improvement.
Lead a hands‐on team of applied scientists/engineers: set direction, unblock technically, review designs/code, and raise the bar on experimentation velocity and reliability.
Fine‐tune and customize models using state‐of‐the‐art techniques (LoRA/QLoRA, PEFT, instruction tuning, preference optimization/RLAIF) with robust HDL-specific evals: Compile‐/lint‐/simulate‐based pass rates, pass@k for code generation, constrained decoding to enforce syntax, and "does‐it‐synthesize” checks.
Design privacy‐first ML pipelines on AWS: Training/customization and hosting using Amazon Bedrock (including Anthropic models) where appropriate; SageMaker (or EKS + KServe/Triton/DJL) for bespoke training needs. Artifacts in S3 with KMS CMKs; isolated VPC subnets & PrivateLink (including Bedrock VPC endpoints), IAM least‐privilege, CloudTrail auditing, and Secrets Manager for credentials. Enforce encryption in transit/at rest, data minimization, no public egress for customer/RTL corpora.
Stand up dependable model serving: Bedrock model invocation where it fits, and/or low‐latency self-hosted inference (vLLM/TensorRT‐LLM), autoscaling, and canary/blue-green rollouts.
Build an evaluation culture: automatic regression suites that run HDL compilers/simulators, measure behavioral fidelity, and detect hallucinations/constraint violations; model cards and experiment tracking (MLflow/Weights & Biases).
Partner deeply with hardware design, CAD/EDA, Security, and Legal to source/prepare datasets (anonymization, redaction, licensing), define acceptance gates, and meet compliance requirements.
Drive productization: integrate LLMs with internal developer tools (IDEs/plug‐ins, code review bots, CI), retrieval (RAG) over internal HDL repos/specs, and safe tool‐use/function-calling.
Mentor & uplevel: coach ICs on LLM best practices, reproducible training, critical paper reading, and building secure‐by‐default systems.

10+ years total engineering experience with 5+ years in ML/AI or large‐scale distributed systems; 3+ years working directly with transformers/LLMs.
Proven track record shipping LLM‐powered features in production and leading ambiguous, cross‐functional initiatives at Staff level.
Deep hands‐on skill with PyTorch, Hugging Face Transformers/PEFT/TRL, distributed training (DeepSpeed/FSDP), quantization‐aware fine‐tuning (LoRA/QLoRA), and constrained/grammar‐guided decoding.
AWS expertise to design and defend secure enterprise deployments, including: Amazon Bedrock (model selection, Anthropic model usage, model customization, Guardrails, Knowledge Bases, Bedrock runtime APIs, VPC endpoints) SageMaker (Training, Inference, Pipelines), S3, EC2/EKS/ECR, VPC/Subnets/Security Groups, IAM, KMS, PrivateLink, CloudWatch/CloudTrail, Step Functions, Batch, Secrets Manager.
Strong software engineering fundamentals: testing, CI/CD, observability, performance tuning; Python a must (bonus for Go/Java/C++).
Demonstrated ability to set technical vision and influence across teams; excellent written and verbal communication for execs and engineers.

Familiarity with Verilog/SystemVerilog/RTL workflows: lint, synthesis, timing closure, simulation, formal, test benches, and EDA tools (Synopsys/Cadence/Mentor).
Experience integrating static analysis/AST‐aware tokenization for code models or grammar-constrained decoding.
RAG at scale over code/specs (vector stores, chunking strategies), tool‐use/function-calling for code transformation.
Inference optimization: TensorRT‐LLM, KV‐cache optimization, speculative decoding; throughput/latency trade‐offs at batch and token levels.
Model governance/safety in the enterprise: model cards, red‐teaming, secure eval data handling; exposure to SOC2/ISO 27001/NIST frameworks.
Data anonymization, DLP scanning, and code de‐identification to protect IP.

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Machine Learning Engineer Resume Examples

•

Machine Learning Engineer Cover Letter Examples

Staff Machine Learning Engineer

Job Search Resources

Tools

Career Hubs

Guides

Company