Senior Infrastructure Engineer (KVM Compute / Distributed Storage)

Teraswitch•Pittsburgh, PA

46d•Onsite

About The Position

Engineered to outperform, Teraswitch is on a mission to provide high-performance infrastructure services for critical workloads. With 20+ datacenter locations around the world interconnected by our low latency global backbone network, we are the class leader in performance bare metal hosting and rapidly expanding into additional infrastructure services. The Job The Infrastructure Engineering team at Teraswitch is responsible for the compute, storage, and platform infrastructure that powers our products and internal operations. This senior/staff-level role is focused on building provider-grade hosted compute and storage services—specifically a KVM-based VM product and a distributed object (S3) and block storage product (NVMe/TCP). Qualified candidates will have depth in at least one of these areas. You will help architect and build cloud-scale, globally distributed products for a high-performance infrastructure provider, with an emphasis on automation, scalability, and security by design. While this role has a compute and storage services focus, as a senior member of the Infrastructure Engineering team, you’ll also be expected to cross-train and contribute broadly across infrastructure domains as we grow the team.

Requirements

Strong Linux systems and networking expertise, production operations experience
Depth in at least one of the following:
Compute / virtualization: KVM/QEMU, libvirt and/or platforms such as Proxmox/OpenStack; image pipelines; fleet operations; multi-tenant considerations
Distributed storage services: experience with distributed storage platforms (Ceph, VAST, Weka, or similar) and/or managing block/object storage offerings; public/multi-tenant deployment experience is a plus
Automation - experience in scripting (Python, bash, etc) and/or configuration management (Ansible or similar)
Experience with observability/monitoring systems (metrics, logs, traces, alerting) and using them to enhance production service reliability
Comfortable working in a fast-paced, results-oriented environment
Committed to operational best practices and security by design

Nice To Haves

Service / hosting provider experience (multi-tenant systems, automation-first operations, scalable and secure design)
Experience with VPS/KVM hosting at scale, including networking and security
Experience with distributed storage systems such as Ceph, Weka, or VAST, particularly in a service provider environment
Expertise in object storage / S3 services - gateway/front-door patterns (F5/Nginx/HAProxy), networking, durability, security
Strong networking fundamentals relevant to provider environments (routing/segmentation, IPAM/DHCP/DNS integration)
Cloud-native observability/monitoring (e.g. Prometheus, Grafana, OpenTelemetry)
Kubernetes and cloud-native (CNCF) ecosystem experience
Demonstrated ability to design and operate automation-first infrastructure at scale
Experience in other Infrastructure team domains - e.g. self-hosted Kubernetes deployment / management, and/or bare metal automation and fleet management

Responsibilities

Design and implement provider-scale, globally distributed hosted services - with a focus in either compute (KVM-based cloud), storage (distributed object and block services), or both
Compute track: Evaluate/design, implement, and manage a KVM-based cloud compute platform
Storage track: Evaluate, implement, and manage a distributed storage platform (Ceph, Weka, VAST, etc) that supports object (S3) and block (NVMe/TCP) protocols
Define provisioning workflows, node/fleet management, and scalable operations
Integrate service networking primitives (IPAM, DHCP, DNS) and customer interfaces to the product
Design multi-tenant provisioning and controls: isolation boundaries, quotas/limits, metering, and security
Build automation and tooling for global deployments of these products: upgrades, capacity expansion, failure handling, rebalancing
Implement robust observability for these products to enhance production service reliability (metrics, logs, traces; dashboards; actionable alerting)
Collaborate with the Software team to integrate these products with our customer control plane (portal, API) and billing systems, ensuring robust customer-driven lifecycle management
Cross-train with the rest of the Infrastructure Engineering team and contribute broadly to the compute, storage, and platform infrastructure that powers Teraswitch products and internal operations
Participate in an on-call system supporting critical production systems.