Senior Applied Scientist - Machine Learning Systems Engineer- Photoshop

Adobe•Waltham, MA

21d

About The Position

The Opportunity Photoshop ART is seeking a Senior Machine Learning (ML) Systems & Efficiency Engineer to join our R&D team focused on delivering practical, production-ready improvements in inference performance, latency, and cost efficiency across image editing applications. This role sits at the intersection of model architecture, systems, inference runtimes, and services, with a clear mandate: deliver high-quality ML systems at substantially lower cost and higher efficiency. Individuals in this role are expected to have deep expertise in areas such as Artificial Intelligence (AI), ML systems, and computer vision. Strong preference will be given to candidates with experience in distributed inference, multimodal model profiling, and performance optimization. You will work closely with research, product, and infrastructure teams to influence model design decisions, improve GPU utilization, and build scalable, cost-aware ML systems deployed in production. This is a hands-on, high-leverage role where a single engineer can drive outsized impact, potentially saving millions of dollars in compute costs. The ideal candidate will have a strong interest in developing practical innovations that advance Adobe products.

Requirements

Master’s or PhD in Computer Science, Electrical Engineering, or a related field, with a focus on machine learning systems, distributed systems, or high-performance computing.
Hands-on experience implementing and scaling large-scale inference or serving workloads using distributed frameworks and runtime systems (e.g., Triton, vLLM, SGLang, xDiT, or similar).
Experience applying inference compilation and optimization tools (e.g., TensorRT, ONNX Runtime, AOTI), including techniques such as operator fusion and graph-level optimization, with a strong understanding of system-level performance tradeoffs.
Strong understanding of GPU architecture (e.g., memory hierarchy, compute throughput, communication bandwidth) and practical experience diagnosing performance bottlenecks across compute, memory, and I/O subsystems.
Proficiency in Python and C++, with experience building high-performance or distributed systems.
Familiarity with CUDA or Triton for performance-critical workloads is highly desirable.
Demonstrated ability to make engineering decisions based on rigorous measurement and benchmarking, with a focus on improving system efficiency, scalability, and reliability in production environments.

Nice To Haves

Experience contributing to or maintaining performance- or efficiency-focused libraries or systems.
Hands-on experience with open-source serving frameworks (e.g., vLLM, SGLang, xDiT, or similar).
Hands-on experience with inference compilation tools (e.g., TensorRT, Triton, AOTI, or equivalent, operation fusion, or graph-level optimization).
Hands-on experience with GPU profiling and performance analysis tools (e.g., PyTorch Profiler, NVIDIA Nsight, CUDA tooling).
Exposure to low-level communication libraries such as NCCL and a practical understanding of collective operations (e.g., AllReduce, AllGather) in large-scale distributed serving environments.
Familiarity with containerized workflows (Docker, Kubernetes) and job scheduling in headless Linux environments, including experience operating production ML workloads on shared GPU clusters.
Working knowledge of model architectures such as Transformers, multimodal models, Mixture-of-Experts (MoE), or Diffusion Transformers (DiT).

Responsibilities

Design and optimize high-throughput, low-latency inference systems.
Optimize model architectures to improve deployment and runtime efficiency using techniques such as distillation, pruning, quantization, and Mixture-of-Experts (MoE).
Implement advanced serving strategies including batching, caching (KV, semantic, embedding), quantization (FP8/INT8), and distributed inference strategies including data, tensor, pipeline, expert, and hybrid parallelism, with a focus on balancing computation and communication efficiency.
Explore training or fine-tuning approaches when they directly lead to more efficient inference, simpler deployment, or improved runtime performance.
Write and maintain high-performance GPU kernels using Triton or CUDA to accelerate custom model layers and critical workloads.
Improve GPU utilization through kernel fusion, asynchronous pipelines, and optimized scheduling strategies.
Conduct deep performance analysis using tools such as PyTorch Profiler and NVIDIA Nsight to identify bottlenecks in compute, memory, and communication.
Optimize end-to-end system performance across inference workloads.
Partner with infrastructure teams to design scalable and reliable distributed serving systems across heterogeneous hardware environments (e.g., A100, H100, B200, CPU).
Contribute to resource scheduling, GPU pooling, and elastic workload management.
Establish and track efficiency metrics such as cost per million inferences.
Build benchmarking frameworks and dashboards to guide tradeoffs among quality, latency, and compute cost, enabling data-driven system and product decisions.
Serve as a trusted technical advisor to research and product teams on efficiency tradeoffs.
Define best practices for scalable and cost-efficient ML development and mentor engineers on performance-oriented systems design.