Senior Staff Engineer, AI Infrastructure

Samsung•San Jose, CA

2d•Onsite

About The Position

Samsung, a world leader in advanced semiconductor technology, is founded on a simple philosophy – the endless pursuit of excellence will create a better world for all. At Samsung Austin Research and Development Center (SARC) and Advanced Computing Lab (ACL), we are building a center of excellence for Intellectual Property (IP) that is applied to high-performance computing devices (mobile, automotive, and other custom market segments) consumed by millions of people around the world. Come build with us! As a Senior Staff AI Infrastructure Engineer, you will architect and scale the foundational platforms that enable AI-driven silicon development across GPU and system-level design teams. Your work will directly impact how efficiently large-scale simulation, design workflows, and machine learning workloads are executed, accelerating the development, validation, and optimization of Samsung’s next-generation IPs. In this high-impact individual contributor role, you will lead the development of distributed, data-intensive infrastructure and data platforms that integrate directly with engineering workflows. You will drive cross-functional collaboration with hardware, software, and system teams to integrate AI capabilities, accelerating scalable model training, data processing, and deployment within the silicon development lifecycle. Leveraging your expertise in one or more technical areas, you will architect and deploy scalable AI/HPC infrastructure (on-prem, cloud, or hybrid) optimized for EDA workflows, large-scale simulation, and machine learning workloads. You thrive on developing and operationalizing MLOps pipelines for model training, validation, and deployment integrated with semiconductor design and verification flows. You elevate technical excellence by building and managing large-scale data pipelines to ingest, process, and serve engineering datasets (e.g., simulation logs, waveform data, layout, silicon telemetry). You drive adoption of AI-enabled workflows across hardware, software, and system teams to optimize GPU cluster and accelerator utilization through advanced scheduling, orchestration, and system-level tuning. You inspire high performance by mentoring junior engineers, fostering a culture of ownership and innovation, and staying ahead of emerging AI/ML technologies for high-performance computing.

Requirements

11+ years of experience with a Bachelor’s degree in Computer Science/Electrical/Engineering or related field, or 9+ years with a Master’s degree, or 7+ years with a Ph.D.
7+ years of expertise in distributed systems, high-performance computing, and infrastructure design at scale.
Strong experience with containerization and orchestration technologies (e.g., Kubernetes, Docker) and workload schedulers (e.g., Slurm, LSF).
Strong proficiency in systems programming language (e.g., Python, C++), with experience building production-grade software and automation.
Hands-on experience deploying and scaling machine learning frameworks (e.g., PyTorch, TensorFlow) and implementing MLOps practices.
Experience with large-scale data processing, storage architectures, and pipeline orchestration.
Excellent analytical, and problem-solving skills, with the ability to propose data-driven solutions and execute.
Excellent collaboration and communication skills, with the ability to navigate ambiguity and influence in a fast-paced, global team environment.
Ability to access information subject to U.S. export control restrictions or be eligible to receive a government authorization to access export-controlled information.

Nice To Haves

Working knowledge of semiconductor design workflows and EDA tools.
Experience with high-performance interconnects (e.g., PCIe, NVLink, InfiniBand) or accelerator-based systems.
Working knowledge in GPU physical design, design verification, or CAD environments.

Responsibilities

Architect and scale foundational platforms for AI-driven silicon development across GPU and system-level design teams.
Lead the development of distributed, data-intensive infrastructure and data platforms that integrate directly with engineering workflows.
Drive cross-functional collaboration with hardware, software, and system teams to integrate AI capabilities.
Accelerate scalable model training, data processing, and deployment within the silicon development lifecycle.
Architect and deploy scalable AI/HPC infrastructure (on-prem, cloud, or hybrid) optimized for EDA workflows, large-scale simulation, and machine learning workloads.
Develop and operationalize MLOps pipelines for model training, validation, and deployment integrated with semiconductor design and verification flows.
Build and manage large-scale data pipelines to ingest, process, and serve engineering datasets (e.g., simulation logs, waveform data, layout, silicon telemetry).
Drive adoption of AI-enabled workflows across hardware, software, and system teams to optimize GPU cluster and accelerator utilization through advanced scheduling, orchestration, and system-level tuning.
Mentor junior engineers, foster a culture of ownership and innovation, and stay ahead of emerging AI/ML technologies for high-performance computing.