Infrastructure Engineer - Virualization

TensorWave•Las Vegas, NV

23d

About The Position

TensorWave is seeking a Virtualization Operations Engineer to focus on the day-to-day operation, stability, and performance of their virtualization platforms. This role is responsible for ensuring that hypervisor environments are reliable, performant, and scalable. This is a hands-on operations role working across hypervisors, virtual machines, and underlying infrastructure systems, supporting high-performance AI workloads across multiple data centers. The environment includes GPU-intensive systems, high-throughput networking, and rapidly scaling compute clusters.

Requirements

4–7+ years of experience in infrastructure, systems, or platform operations
Hands-on experience operating Linux-based virtualization platforms, such as KVM/QEMU, Proxmox, VMware (with strong Linux fundamentals)
Strong Linux systems knowledge, including process management, networking, disk and filesystem management
Experience troubleshooting CPU and memory contention, disk I/O bottlenecks, network performance issues
Familiarity with virtualization concepts: VM lifecycle, resource allocation, live migration
Experience with infrastructure automation tools (e.g., Ansible or similar)
Ability to work effectively during incidents and production issues

Nice To Haves

Experience operating infrastructure at scale (100+ hosts)
Familiarity with GPU-based systems or high-performance workloads, NUMA awareness and performance tuning
Exposure to high-throughput networking (bonding, VLANs, SR-IOV), distributed or high-performance storage systems
Experience working alongside Kubernetes or container platforms
Experience in cloud or CSP environments

Responsibilities

Operate and maintain large-scale virtualization environments (Proxmox and/or KVM-based systems)
Manage the full lifecycle of virtual machines: provisioning, configuration, migration, decommissioning
Monitor and respond to platform health issues, including host failures, VM performance degradation, resource contention (CPU, memory, disk, network)
Troubleshoot and resolve issues across hypervisors, guest operating systems, storage and networking layers
Execute infrastructure changes safely, including cluster expansions, host maintenance and upgrades, configuration updates
Work with automation tools to standardize deployments, reduce manual intervention, improve operational consistency
Collaborate with DevOps (automation and platform tooling), Network Engineering (connectivity and performance), Storage Engineering (I/O performance and reliability)
Participate in incident response and root cause analysis
Contribute to runbooks, documentation, and operational best practices

Benefits

Stock Options
100% paid Medical, Dental, and Vision insurance for Employees
Company Health Savings Account Contributions
100% paid Short Term and Long Term Disability Insurance for Employees
Life and Voluntary Supplemental Insurance Options
Other Insurance Options, such as Pet & Legal Insurance
Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
Flexible Spending Account
401(k)
Employee Assistance Program
Flexible PTO
Paid Holidays
Parental Leave
Other In-Office Perks