Sr. SRE Platform Software Engineer

Bitdeer Technologies Group•San Jose, CA

About The Position

Build and operate one or more bounded contexts of the NeoCloud SRE platform — the multi-region substrate that observes, protects, and operates a GPU rental fleet across self-built and OEM-rented data centers. You take an architect-approved design and turn it into production code that ships through GitOps + the CICD release pipeline, ride the Plugin Framework conventions, meet declared SLOs, and stay drift-free. This is the build + run role. You don't only ship code; you ship a service that other squads, cloud-service teams, and tenants depend on. You take the on-call pager for what you build.

Requirements

7+ years of production software engineering experience, including 2 or more years operating what you built (real on-call experience, not just shipping code).
Production-depth mastery of at least one systems-grade language—Go (preferred), Rust, or Java.
Proficiency in Python for tooling and SDK work.
Strong grasp of at-least-once vs. exactly-once trade-offs, idempotency, back-pressure, leader election, consistent hashing, gossip, and fan-out. Ability to evaluate CRDT vs. Raft vs. Paxos and select the right tool for the job.
Experience at production scale with Prometheus, VictoriaMetrics, Mimir, Thanos, Loki, Elasticsearch, Tempo, Jaeger, or OpenTelemetry. Must have built or substantively contributed to the ingest, query, or storage paths of these systems.
Hands-on experience with Argo, Flux, Helm, Kustomize, Cosign signing, signed-bundle promotion, and blast-radius-aware rollouts.
Proven experience writing a controller or CRD handling real production traffic, with a deep understanding of watch-cache mechanics, leader election, and reconcile loops.
Experience executing end-to-end mTLS bootstrap with certificate rotation. Hands-on experience with HashiCorp Vault or cloud KMS (AWS KMS / GCP KMS).
Ability to read a Prometheus query plan, build a recording-rule strategy, and write SQL that joins per-tenant telemetry against analytics-lake tables.
Rigorous approach to unit, integration, contract, chaos, and soak testing. Experience writing and maintaining your own comprehensive tests.
Ability to author clear design docs that align with existing platform architecture, create runbooks optimized for 3 AM on-call responses, and write intent-driven PR descriptions.

Nice To Haves

Experience in at least one of the following areas is a strong plus: NVIDIA Internals: Deep understanding of DCGM and NVIDIA driver internals, including XID semantics and MIG / vGPU partitioning.
Networking & Fabrics: Experience with InfiniBand or RoCE fabrics, including subnet managers, partitioning, optical health, and NCCL collective tracing.
HPC Storage: Experience managing Lustre, NetApp, Pure, DDN, VAST, or NVMe-oF under multi-tenant loads.
Hardware Management: Hands-on experience with BMC, IPMI, and Redfish at OEM scale (Supermicro, Dell, HPE, Lenovo).
Cluster Platform Internals: Familiarity with Kubernetes GPU Operator, Slurm controller, or Ray GCS.
BS/MS in Computer Science or similar
Hyperscale or NeoCloud experience

Responsibilities

Collection & Storage: collection-agent, customer-sdk-gateway, metrics-store, logs-store, traces-store, profiles-store, analytics-lake, enrichment-service, collection-monitor.
Alert, Correlation & SLO: alert-engine-framework, alert-correlation, slo-framework, default M-series alert rules.
Topology, Cluster-Health & Cluster Platform Services: topology-service, cluster-health-rollup, OSS-SRE-tool collection plugins for K8s, Slurm, Ray, Volcano, Kueue, and KubeRay.
Fault-Prediction: prediction-engine-framework and built-in predictors (GPU, Link, Disk, XPA, Straggler, SDC, Stranded GPU).
Remediation, Workflow, Inspection & Jobs: remediation-actuator, orchestration-substrate (workflow engine), inspection-orchestrator, job-scheduler, NCCL-baseline inspection probe.
Hardware Lifecycle & DC Ops: hardware-lifecycle, dc-operations, boot-provisioning, rolling-upgrade, bare-metal-bmc-service, auto-discovery, ZTP D0–D5 pipeline, IPMI bare-metal management.
Identity, Secrets, Tenant-Config & CMDB: iam-service, secrets-service, tenant-sre-config, cmdb-cache, schema registry.
Customer-Bridge, Ticketing & SRE Platform Portal: customer-bridge, customer-ticketing, sre-operation-system, Customer Console BFF, SRE Console BFF.
Backup, DR & Meta-Monitor: backup-orchestrator, meta-monitor, external-watcher integration (Datadog or equivalent).
CI/CD, GitOps, Plugin Framework & SRE Image Registry: cicd-pipeline, gitops-sync, plugin-registry, sre-image-registry.
Self-Improving Agent: agent-control-plane, agent-discovery, agent-codegen, agent-sandbox, per-Region LLM gateway.
Global SRE Management: maintenance-window-orchestrator, change-management, capacity-planner, cost-optimizer, gpu-efficiency-dashboard, network-stability-dashboard, patching-orchestrator, artifact-management, compat-matrix-service, security-platform.