About The Position

We are seeking an experienced Tech Lead to own and drive our on-prem storage platform and associated data infrastructure. This role requires deep hands-on expertise across on-prem hardware, Linux systems, storage, networking, automation, and observability, along with the ability to provide technical leadership, mentor engineers, and partner with platform and data teams. You will act as the technical owner for the on-prem storage and data ecosystem, ensuring reliability, scalability, performance, and operational excellence.

Requirements

  • Strong experience as a Tech Lead / Senior Platform Engineer / Storage Engineer
  • Deep expertise in Linux systems administration
  • Hands-on experience with on-prem storage platforms, especially MinIO
  • Strong understanding of: Physical servers NVMe storage GPUs Networking concepts (TCP/IP, DNS, routing, proxies, load balancers,switches)
  • Proficiency in automation and scripting: Python Ansible Shell scripting
  • Experience with data platforms: Alluxio Spark / Spark SQL MySQL Hive Metastore
  • Strong experience with observability and logging stacks
  • Experience managing infrastructure using Terraform
  • Familiarity with HashiCorp Vault for secrets management
  • Strong troubleshooting, debugging, and performance optimization skills
  • Excellent communication skills and ability to lead technical discussions

Nice To Haves

  • Prior experience as a technical lead or senior SRE/Platform Engineer owning a critical storage or data platform.
  • Experience in high‑availability, large‑scale, or low‑latency environments (e.g., analytics platforms, data lakes, or AI/ML infrastructure).
  • Familiarity with security and compliance requirements for enterprise environments (access controls, auditing, encryption, backup/restore policies).

Responsibilities

  • Serve as the technical lead and owner for the on-prem MinIO storage and Alluxio (caching) platform.
  • Lead architecture, implementation, and lifecycle management of the on-prem storage and data infrastructure (capacity, performance, resiliency, DR, security).
  • Manage and optimize on-prem hardware, including: Physical servers NVMe storage GPUs underlying OS configuration (RHEL, CentOS, Rocky Linux). Network components
  • Partner with networking team to Design and manage networking components for the platform, including : VLANs, routing, DNS, network switches, proxy, and load-balancing architectures
  • Lead and support automation and infrastructure tooling, including: Python Ansible Shell scripting Terraform
  • Support and integrate data and analytics platforms, including: Alluxio as a data orchestration/caching layer integrated with MinIO and compute engines (e.g., Spark). Apache Spark Spark SQL MySQL Hive Metastore
  • Implement and maintain monitoring, logging, and observability solutions: Grafana Prometheus Datadog Kibana Elasticsearch Logstash
  • Define SLOs/SLIs/error budgets for storage and data services; lead incident response, root‑cause analysis, and long‑term remediation.
  • Collaborate with security, networking, DBA, and data engineering teams to ensure compliant, performant, and reliable services.
  • Ensure high availability, performance tuning, capacity planning, and disaster recovery
  • Drive best practices for GitHub workflows, code reviews, and automation pipelines
  • Troubleshoot complex production issues across storage, compute, networking, and applications
  • Mentor engineers, review designs, and set technical standards and best practices for the platform
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service