Senior Staff Back-end Engineer

Coupang

About The Position

Coupang Intelligence Cloud (CIC) builds and operates the compute platform powering Coupang's AI/ML and large-scale workloads. We are now extending the platform from serverless container-based workload support to virtualization. We are building a VM offering on top of KVM with a tuned Linux kernel. We leverage hardware and software assisted virtualization to get baremetal-like performance within our VMs. We are building a OVN/OVS-based SDN that can be offloaded to the DPU. We are looking for a senior technical leader to drive it. As a Senior Staff Engineer, you will own the hypervisor, host kernel, and DPU layers of our multi-tenant VM platform. You will architect the QEMU-KVM stack, use Nvidia-DOCA for virtualization, design the SDN data plane using OVS and OVN, lead the GPU passthrough strategy for GPU and Infiniband, including NVSwitch and Shared NVLink topologies. You will be partnering closely with engineering leadership in the US, Korea, and China. This is a hands-on individual contributor role with significant technical scope and the opportunity to shape the architecture from the ground up.

Requirements

12+ years of systems software engineering experience, with at least 6 years focused on virtualization, hypervisors, and/or the Linux kernel
Deep, hands-on expertise with QEMU-KVM internals — virtio, vhost-user, machine types, CPU topology, NUMA pinning, hugepages, live migration
Strong Linux kernel proficiency — KVM, vfio, vhost, namespaces, cgroups, netfilter, eBPF, scheduler, memory management; comfortable reading and patching kernel code
Experience building, customizing, or maintaining a production Linux kernel — config tuning, patch management, backports, and (ideally) upstream contributions
Production experience with Open vSwitch — OpenFlow pipeline design, datapath performance tuning (DPDK or kernel datapath), conntrack, debugging at scale
Strong working knowledge of OVN — logical switches/routers, ACLs, distributed gateway routers, NB/SB databases
Solid networking fundamentals: VXLAN, GENEVE, BGP, EVPN, L2/L3 routing, multicast, MTU/MSS handling
Strong systems programming in C and/or Go; contributes to large open-source codebases
Track record of leading complex, cross-team technical initiatives end-to-end

Nice To Haves

Upstream contributions to the Linux kernel (KVM, vfio, vhost, networking, scheduler, or mm subsystems)
Experience with KubeVirt or other Kubernetes-native virtualization platforms
GPU virtualization experience — SR-IOV, vGPU, PCIe passthrough, IOMMU groups, NVSwitch/NVLink topology on NVIDIA H100/H200/B200
Production experience operating BGP EVPN fabrics (Arista EOS, Cumulus, or SONiC)
Upstream contributions to OVS, OVN, QEMU, libvirt, or DPDK
Experience with cloud-init, Cloud Hypervisor, or Firecracker
Experience designing for hyperscale environments — thousands of hypervisors, tens of thousands of VMs, multi-region

Responsibilities

Own the hypervisor stack end-to-end — QEMU-KVM, libvirt, host kernel, and design such that the virtualization logic can be entirely offloaded to the DPU.
Drive Linux kernel strategy for hypervisor hosts — kernel version selection, custom patches, KVM/vfio/vhost subsystem tuning, scheduler and memory tuning for VM workloads, backporting fixes, and contributing patches upstream
Debug and resolve issues across the full virtualization stack — guest, QEMU, KVM, host kernel, and hardware — including performance regressions, livelock, and corner cases that surface only at fleet scale
Architect and own the multi-tenant VM and BareMetal platform, including VM/BareMetal lifecycle. Design and implement the SDN data plane using OVS and OVN — OpenFlow pipeline design, VXLAN/Geneve tunneling, distributed routing, and per-tenant network isolation.
Lead the GPU virtualization strategy: SR-IOV, PCIe passthrough, IOMMU/NUMA topology, and Shared NVLink via NVIDIA Fabric Manager on B200/GB300/RV200.
Drive technical decisions across squads — write design docs, lead design reivew sessions, partner with networking, storage, and platform teams in the US, Korea, China, and India
Set the technical bar for code review, design, and operational excellence; mentor senior and staff engineers
Own production reliability for the virtualization platform — define SLOs, drive incident response, and meet a 99%+ availability target as the platform scales