Infrastructure Engineer (Observability)

Voltage ParkSan Francisco, CA
171dRemote

About The Position

Voltage Park is seeking an Infrastructure Engineer with a focus on Observability to join our Infrastructure Engineering team. Our engineers design and operate the systems that manage thousands of bare-metal servers, GPUs, and high-performance networks across multiple data centers. This role combines the breadth of a core infrastructure engineer with a specialty in observability and telemetry. You’ll design and operate metrics, logs, traces, and alerting pipelines that provide actionable insights for both internal teams and external customers — helping to ensure reliability and transparency at scale. This is a fully remote position, although candidates must be based in the continental United States. Unfortunately, we are unable to provide sponsorship for this role.

Requirements

  • 8+ years in infrastructure engineering, SRE, or observability roles.
  • Strong experience with monitoring systems (Prometheus, Grafana, ELK, VictoriaMetrics, or similar).
  • Proficiency in Python, Go, or bash for automation and data integration.
  • Familiarity with container/Kubernetes observability.
  • Understanding of streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent).
  • Strong written and verbal communication skills.

Nice To Haves

  • Experience with GPU observability, particularly NVIDIA DCGM.
  • Designing multi-tenant observability solutions with RBAC and scoped queries.
  • Prior work with correlation engines for RCA, forecasting, or predictive alerting.
  • Broader exposure to infrastructure domains (networking, storage, provisioning).

Responsibilities

  • Design, build, and maintain observability platforms spanning metrics, logs, traces, and events.
  • Create dashboards and alerting for internal stakeholders (InfraOps, Engineering, Customer Success) and scoped visibility for external customers.
  • Ingest and correlate telemetry from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish.
  • Implement noise-resistant alerting pipelines that improve detection and reduce operational load.
  • Collaborate with infrastructure, platform, and customer-facing teams to embed observability into workflows.
  • Contribute to broader infrastructure engineering projects beyond observability.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service