Senior Software Engineer - AI Infra Visibility

Clockwork•Palo Alto, CA

About The Position

We are looking for a strong Senior Software Engineer to help design and build scalable backend systems for AI and GPU cluster observability . In this role, you will work on high-performance distributed systems that power telemetry ingestion, data processing, and APIs for monitoring large-scale GPU clusters and AI workloads.

Requirements

7+ years of industry experience building and operating production software systems.
Strong foundation in data structures, algorithms, and software design.
Fluency in one or more programming languages: C, C++, Go, Java, or Python .
Experience designing, building, and scaling large distributed systems .
Hands-on experience with service-oriented architectures and cloud platforms (AWS, GCP, Azure).
Solid understanding of operating systems fundamentals (threads, scheduling, synchronization; kernel programming is a plus).
Experience with databases , including design, development, or scaling.
Excellent debugging, problem-solving, and communication skills.

Nice To Haves

Knowledge of networking protocols ; familiarity with NIC architecture and operation.
Understanding of GPU or AI infrastructure (e.g., DCGM, PyTorch).
Familiarity with observability systems (metrics, logs, traces); experience with OpenTelemetry, Prometheus, or distributed tracing.
Enjoy Challenging projects.

Responsibilities

Design and build scalable backend systems for metric collection, processing, and analysis.
Develop robust methods to detect complex infrastructure issues that impact AI workloads.
Build large distributed systems running in production environments.
Collaborate across teams to deliver reliable, performant, and maintainable systems.