Staff Engineer, High Performance Data & Algorithm Infrasturcture

Foresite Labs (Stealth Co)•San Diego, CA

10d•Onsite

About The Position

We are looking for a Senior Staff Software Engineer with deep expertise in high-performance computing (HPC), Linux systems, and GPU-accelerated data pipelines. This is a highly technical, hands-on role focused on extracting maximum performance from modern CPUs, GPUs, memory subsystems, and high-speed networks. You will work close to the hardware and operating system, tuning kernels, BIOS settings, and drivers, while also designing and implementing low-latency data processing pipelines that include real-time signal processing. If you enjoy profiling, tuning, and eliminating bottlenecks across the full stack— from BIOS to CUDA kernels to network offload—this role is for you.

Requirements

7+ years of professional software engineering experience (or equivalent depth)
Strong background in high-performance computing or performance-critical systems
Expert-level Linux experience, including kernel and system tuning
Deep experience with GPU computing and CUDA (required)
Strong systems programming skills in C/C++ (and/or Rust)
Solid understanding of computer architecture:
CPU caches, NUMA, memory hierarchies
PCIe and DMA
GPU architectures
Extensive experience profiling and tuning complex systems
Comfortable using tools such as perf, ftrace, eBPF, valgrind, Nsight, and similar
Ability to reason quantitatively about latency, bandwidth, and throughput
Practical experience implementing DSP algorithms in production systems
Strong understanding of FFTs, convolution/deconvolution, filtering, and thresholding
Ability to optimize numerical algorithms for real-time or near-real-time constraints
BS/MS in Computer Science or Engineering

Nice To Haves

Experience with RDMA, GPUDirect RDMA, or other hardware offload technologies
Experience with custom kernel builds or kernel module development
Familiarity with real-time or low-latency Linux variants
Experience deploying HPC workloads at scale
Background in scientific computing, signal processing, or computational physics

Responsibilities

Design, build, and optimize high-throughput, low-latency compute pipelines
Profile and tune performance across CPUs, GPUs, memory, storage, and networking
Identify and eliminate bottlenecks in data movement and computation
Work directly with hardware and OS configuration to achieve deterministic, repeatable performance
Configure and tune Linux systems for high-performance workloads
Customize and tune Linux kernel parameters (scheduler, NUMA, IRQs, huge pages, IOMMU, etc.)
Tune CPU and BIOS parameters (power states, frequency scaling, SMT, NUMA, memory timing)
Manage and optimize DMA paths between devices and system memory
Minimize context switches, cache misses, and system jitter
Develop and optimize GPU-accelerated compute pipelines using CUDA
Optimize memory transfers between host and GPU (pinned memory, zero-copy, GPUDirect where applicable)
Tune kernel launches, memory access patterns, and occupancy
Configure and manage GPU drivers, runtime, and system-level settings for maximum throughput
Profile GPU workloads using tools such as Nsight Systems and Nsight Compute
Optimize high-speed data ingestion and offload to HPC systems
Work with low-latency and high-bandwidth networking technologies (e.g., RDMA, InfiniBand, high-speed Ethernet)
Minimize data transfer latencies across network, PCIe, and memory boundaries
Design zero-copy or near-zero-copy data paths where possible
Implement and optimize digital signal processing algorithms, including:
FFTs
Deconvolution
Thresholding and detection algorithms
Optimize DSP workloads for CPU vectorization and GPU acceleration
Balance numerical accuracy, latency, and throughput constraints