Sr. SRE Platform Architect

Bitdeer Technologies GroupSan Jose, CA

About The Position

Bitdeer is seeking a visionary and hands-on Cloud SRE Architect to lead the design, development, and evolution of our next-generation public cloud platform. This role will oversee the end-to-end architecture across CPU, GPU, RDS, storage, networking, serverless, and AI services, ensuring global scalability, reliability, and performance. The ideal candidate is a strategic thinker with deep technical expertise in cloud infrastructure, platform engineering and AI systems, capable of bridging architecture vision with real-world engineering execution. You will collaborate closely with cross-functional teams and global partners to define our cloud technology roadmap, optimize multi-region deployments, and deliver world-class infrastructure and platform solutions that power large-scale AI and enterprise workloads.

Requirements

  • 10+ years of production SRE / platform-engineering / infra-architecture, including ≥ 3 years at architect level.
  • Hands-on with GPU / AI-compute infrastructure — NVIDIA GPU ops (DCGM, MIG, vGPU, NVLink/NVSwitch, XID semantics, NCCL), InfiniBand or RoCE fabrics (subnet manager, fabric partitioning, optical health), HPC storage (Lustre, NetApp/Pure/DDN/VAST, NVMe-oF).
  • Multi-region observability at scale — metrics / logs / traces / profiles / analytics-lake substrate; recording rules, MWMBR burn-rate alerting, SLI/SLO discipline.
  • Cluster platforms — first-hand experience with Kubernetes (control plane + GPU Operator + topology-aware scheduling) AND at least one of Slurm / Volcano / Kueue / Ray / KubeRay.
  • Data-center operations — ZTP, BMC/IPMI/Redfish, BIOS/firmware lifecycle, RMA, multi-vendor OEM management (self-built + leased DC mix).
  • Strong DDD instincts — bounded contexts, public contracts, no shared databases, one-context-one-repo discipline.
  • Plugin framework design — you have built (or substantively contributed to) a real extension framework with a uniform manifest + lifecycle.
  • Writing fluency — you can author and maintain a multi-thousand-line architecture document under review without it drifting; you can also write a one-pager an executive will read.
  • Cross-team operating tempo — design reviews, runbook authorship, on-call shadowing, post-mortem facilitation
  • BS/MS in Computer Science or similar

Nice To Haves

  • Hyperscale or NeoCloud experience

Responsibilities

  • Own the end-to-end architecture of the NeoCloud SRE platform — the substrate that observes, protects, and operates a multi-region GPU rental fleet across self-built and OEM-rented data centers.
  • Write and maintain the platform architecture document — keep the design coherent across all sections, frameworks, and tiers.
  • Review every framework-level change — new bounded context, new plugin kind, tier-deployment shift, schema change, naming change, cross-context contract change.
  • Set design invariants — residency rules (raw data stays in Region), Tier 2 self-sufficiency budget (≥ 24 h), survival-uplink contracts, naming conventions, SLO catalogues, redaction-at-boundary rules.
  • Run the plugin framework — every extension uses one uniform contract (Common + Domain manifest, lifecycle, observability). Author and evolve this contract.
  • Decide tier placement — what runs at Edge DC vs Regional Controller vs Global Hub, with data-residency / compliance / availability tradeoffs explicit.
  • Coordinate with cloud-service teams and tenants — they author plugins, SDKs, dashboards, agent recipes that ride the platform. Set the contracts they consume.
  • Coordinate with Security — joint ownership of vulnerability management, exposure management, joint operations. Security owns policy and risk acceptance; you own the operational mechanisms they ride.
  • Pre-flight roadmap items — for any new capability, produce a one-page design that fits the existing layered model (L1–L6), tier topology, naming conventions, and extension contracts before implementation starts.
  • Defend the design under review — say no to scope creep, special-case workarounds, and one-off integrations that don't fit the framework model. Say yes when a new plugin kind is genuinely needed.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service