Senior IaaS / Kubernetes Platform Engineer (worldwide remote, work anywhere)

Cloudlinux

23h•Remote

About The Position

CloudLinux is a global remote-first company driven by principles of doing the right thing, prioritizing employees, and delivering high-volume, low-cost Linux infrastructure and security products. We are seeking a Senior IaaS / Kubernetes Platform Engineer to join our Infrastructure Department. This role is crucial for the design, implementation, and operation of our private cloud and multi-tenant Kubernetes platform. Our current infrastructure supports over 500 VMs across multiple datacenters for more than 20 engineering teams. We are transitioning from an OpenNebula-based virtualization platform to a Kubernetes-native multi-tenant cloud utilizing KubeVirt for VM orchestration, while ensuring continued reliability and operational excellence. The successful candidate will collaborate with the IaaS Tech Lead and Network Engineer, and must possess the ability to independently manage the full IaaS stack (compute, storage, networking, bare metal). This is a comprehensive infrastructure role requiring deep generalist skills alongside Kubernetes platform expertise.

Requirements

5+ years in infrastructure/platform engineering roles.
At least 3 years operating production Kubernetes clusters (building and managing the platform itself).
Production experience with at least 3 of the following: KubeVirt or similar VM-on-K8s technology, Cluster API (CAPI), Cilium or Calico, Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs), ArgoCD or Flux for GitOps-driven infrastructure management.
Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting.
Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states.
Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale.
Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations.
Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing.
Strong written and verbal English (B2+ minimum).
Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked.

Nice To Haves

Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation).
Crossplane or similar Kubernetes-native infrastructure abstraction.
Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden.
Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor).
SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks.
FinOps: OpenCost, Kubecost, cloud cost optimization.
Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar.
OpenNebula experience.
Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage.
SR-IOV and DPDK experience for hardware-accelerated networking.
Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt.
Grafana LGTM stack (Mimir, Loki, Tempo) for observability.
Compliance environment experience (SOC2, ISO 27001, NIS2).
Go or Python programming for infrastructure tooling.
Experience with Juniper JunOS switch configuration.

Responsibilities

Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.
Operate and optimize Ceph distributed storage clusters.
Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.
Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
Maintain IPSec site-to-site connectivity between datacenters.
Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
Design and execute chaos engineering experiments to validate system resilience.
Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
Write and maintain runbooks, DRP documentation, and postmortem analyses.
Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.
Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
Write Ansible playbooks for bare-metal server configuration and fleet management.
Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management.
Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost.

Benefits

Focus on professional development
Interesting and challenging projects
Fully remote work with flexible working hours
Work from any location worldwide
Paid 24 days of vacation per year
10 days of national holidays
Unlimited sick leaves
Compensation for private medical insurance
Co-working and gym/sports reimbursement
Budget for education
Opportunity to receive a reward for the most innovative idea that the company can patent