About The Position

We’re looking for an engineer who can take early, sometimes messy, pre-production hardware and make it “real”: bootstrapped, stable, imaged, joined to the right Kubernetes control plane, registered correctly, scheduled, and observable. You’ll sit at the intersection of early HW bring-up, provisioning automation, fleet/cluster management systems, and lab or cloud provider integration—turning new SKUs into capacity that is usable by internal customers.

Requirements

  • BS in CS/EE (or equivalent practical experience).
  • 5+ years of experience in systems SW development and building/operating Linux-based infrastructure in production or pre-production environments.
  • Strong, hands-on experience with:
  • Kubernetes cluster operations (node lifecycle, bootstrap/join, debugging control-plane connectivity)
  • Infrastructure-as-Code / config management (Terraform, Chef/Ansible, etc.)
  • Provisioning and imaging (PXE/iPXE, golden images, cloud-init/user-data)
  • Networking fundamentals (L2/L3, routing, DNS, firewalling; comfort debugging reachability)
  • Proven ability to write automation in Python/Go/Bash and ship operational tooling/runbooks.

Nice To Haves

  • Experience bringing up new hardware platforms (early silicon/servers/NICs) in a lab setting and turning them into stable fleet capacity.
  • Multi-cloud operational experience (Azure/GCP/AWS/OCI), especially with compute pools (e.g., VMSS / instance pools).
  • Experience building telemetry/health pipelines (agent-based metrics/logging, health rollups, readiness criteria).
  • Familiarity with WAN, peering, and multi-site network concepts for cluster deployments.

Responsibilities

  • Own the end-to-end bring-up and bootstrap path for new systems and compute nodes from bare metal/early access in lab or production/cloud environments to schedulable fleet capacity: image build, user-data/config, cluster join, and readiness gates.
  • Build and maintain “first-class” golden image + provisioning workflows across lab, and production environments, including working with partner-provided base images and reconciling OS/version requirements.
  • Work with partner teams to integrate nodes into our fleet infrastructure and IaC pipelines (Terraform, Chef, etc.), ensuring cloud resources map cleanly onto our internal lifecycle expectations (e.g., VMSS/instance pools, image references).
  • Partner with scheduling and platform owners to ensure new hardware is reachable and scheduled (pool definitions, network/WAN connectivity/routing, admission controls, platform-specific quirks), including cases where new SKUs require changes for scheduling integration.
  • Drive registration and inventory correctness (e.g., systems that track nodes and their metadata), including hands-on support to get nodes registered and visible end-to-end.
  • Collaborate with partner teams to implement baseline health + telemetry bring-up: minimum viable health signals, pass/fail checks, and automated reporting suitable for early ramp decisions.
  • Debug issues across layers: PXE/boot-loader, UEFI/BIOS, BMC, OS bring-up, NIC/network reachability, kubelet/control-plane connectivity, storage constraints, and early rack/lab realities.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service