About The Position

We are building and operating large-scale infrastructure platforms to support high-performance AI and machine learning workloads across multiple data centers. Our environment includes GPU-intensive systems, high-throughput networking, and distributed storage platforms that must deliver consistent performance at scale. We are looking for a Staff Infrastructure Engineer – Storage Platform to own the design, operation, and evolution of our storage systems. This role combines architecture and hands-on operational ownership, ensuring that storage platforms are both well-designed and reliably executed in production. You will be responsible for defining how storage works across the organization while remaining deeply involved in real-world system behavior, performance tuning, and incident response.

Requirements

  • 7+ years of experience in infrastructure, storage, or distributed systems
  • Deep hands-on experience with distributed storage systems in production
  • Strong experience with Ceph (RBD, CephFS, and/or RGW)
  • Experience with high-performance storage platforms such as: Weka, VAST Data, or similar
  • Strong understanding of storage performance characteristics
  • Strong understanding of data replication and failure domains
  • Strong understanding of distributed system design principles
  • Strong Linux systems expertise
  • Ability to troubleshoot across storage, network, and compute layers

Nice To Haves

  • Experience supporting AI/ML or HPC workloads
  • Familiarity with NVMe-based architectures
  • Familiarity with RDMA or high-throughput Ethernet
  • Experience integrating storage with Kubernetes at scale
  • Experience operating across multiple data centers
  • Exposure to object storage and S3-compatible APIs

Responsibilities

  • Design and evolve storage architectures supporting Kubernetes (block, file, object storage), AI/ML and high-performance compute workloads
  • Evaluate and select storage technologies based on performance (IOPS, throughput, latency), scalability and fault tolerance, operational complexity and maintainability
  • Define storage standards, best practices, and reference architectures
  • Design for resilience over traditional HA, including failure-domain awareness
  • Own production storage platforms, including Ceph (RBD, CephFS, RGW), High-performance NAS (Weka, VAST, or similar)
  • Lead lifecycle operations - Cluster design and deployment, expansion and scaling, upgrades and migrations
  • Perform and guide capacity planning, performance tuning, failure analysis
  • Analyze storage performance across IOPS, throughput, latency, and tail latency
  • Identify and resolve bottlenecks across disk subsystems, network paths (including RDMA), client access patterns
  • Lead root cause analysis for storage-related incidents
  • Ensure storage platforms meet the demands of GPU and Kubernetes workloads
  • Define and implement Kubernetes storage patterns - CSI drivers, StorageClasses, persistent storage design
  • Troubleshoot complex Kubernetes storage issues involving stateful workloads, provisioning failures, performance anomalies
  • Partner with platform teams to align storage with workload requirements
  • Design and implement automation for storage deployment and configuration, cluster lifecycle management
  • Leverage tools such as Ansible, Terraform, Kubernetes manifests / Helm
  • Integrate storage platforms into observability stacks (Prometheus, Grafana, etc.)
  • Serve as the technical authority for storage across the organization
  • Mentor engineers on storage systems, performance, and troubleshooting
  • Establish operational standards and best practices
  • Drive continuous improvement of storage reliability and performance

Benefits

  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance for Employees
  • Company Health Savings Account Contributions
  • 100% paid Short Term and Long Term Disability Insurance for Employees
  • Life and Voluntary Supplemental Insurance Options
  • Other Insurance Options, such as Pet & Legal Insurance
  • Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
  • Flexible Spending Account
  • 401(k)
  • Employee Assistance Program
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Other In-Office Perks
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service