About The Position

We are looking for a strong Senior Software Engineer to help design and build scalable backend systems for AI and GPU cluster observability . In this role, you will work on high-performance distributed systems that power telemetry ingestion, data processing, and APIs for monitoring large-scale GPU clusters and AI workloads.

Requirements

  • 7+ years of industry experience building and operating production software systems.
  • Strong foundation in data structures, algorithms, and software design.
  • Fluency in one or more programming languages: C, C++, Go, Java, or Python .
  • Experience designing, building, and scaling large distributed systems .
  • Hands-on experience with service-oriented architectures and cloud platforms (AWS, GCP, Azure).
  • Solid understanding of operating systems fundamentals (threads, scheduling, synchronization; kernel programming is a plus).
  • Experience with databases , including design, development, or scaling.
  • Excellent debugging, problem-solving, and communication skills.

Nice To Haves

  • Knowledge of networking protocols ; familiarity with NIC architecture and operation.
  • Understanding of GPU or AI infrastructure (e.g., DCGM, PyTorch).
  • Familiarity with observability systems (metrics, logs, traces); experience with OpenTelemetry, Prometheus, or distributed tracing.
  • Enjoy Challenging projects.

Responsibilities

  • Design and build scalable backend systems for metric collection, processing, and analysis.
  • Develop robust methods to detect complex infrastructure issues that impact AI workloads.
  • Build large distributed systems running in production environments.
  • Collaborate across teams to deliver reliable, performant, and maintainable systems.

Benefits

  • A friendly and inclusive workplace culture.
  • Competitive compensation.
  • A great benefits package.
  • Catered lunch.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service