SRE

Luminary Cloud•San Mateo, CA

56d

About The Position

Luminary Cloud helps engineering companies be more competitive by getting to market faster, creating new, better products, and reducing development risk. We do this with our Physics AI platform, the fastest and easiest way to build and deploy models to understand and instantly predict physical reality with precision. Customers span industries from automotive and aerospace to leading sporting equipment providers, including Otto Aviation, Joby Aviation, Piper Aircraft, and Trek Bikes. Luminary is a Series B company and is headquartered in San Mateo, California. The Luminary Physics AI platform is a SaaS offering that runs on GCP. It uses GPUs for data generation, model training, and mode inference and supports accelerated engineering design workflows. The product generates and consumes large volumes of data for Physics AI models and is used by some of the most demanding customers in automotive, aerospace and defense industries. An elevated security and compliance posture, the ability to maintain five-nine SLAs, use automation for most tasks and managing large data volumes make this an exciting opportunity for a production Site Reliability Engineer. The right candidate will apply software engineering principles to operations, focusing on system reliability, performance, and scalability. You will collaborate closely with engineering and product teams to design, deliver, and scale the core systems that power our platform. You will be responsible for suggesting product changes that allow us to manage 10k users simultaneously on the platform with effective resource management

Requirements

Proven experience designing and implementing scalable SaaS backend systems
Strong understanding of cloud infrastructure (GCP preferred), CI/CD pipelines, and core SRE/DevOps concepts.
5+ years of experience building performant, scalable, distributed systems (or equivalent experience).
10+ years of experience required for Senior/Lead candidates.
Proficiency in Golang and Python is highly desirable.
Familiarity with Kubernetes and container orchestration.
Experience with Infrastructure as Code (Terraform) and cloud automation.
Strong understanding of operational practices and willingness to participate in on-call rotations.
Knowledge of modern security principles and IAM fundamentals.

Nice To Haves

Demonstrated success scaling infrastructure in a startup environment, including multicloud, hybrid, or on-prem deployments.
Proven experience mentoring and guiding engineers, supporting technical growth and career development.
Ability to act as a technical architect, making high-impact design decisions for reliable, scalable, and secure platform systems.

Responsibilities

Participate in on-call rotations and incident response, implementing effective remediation strategies and leading post-incident reviews to prevent recurrence
Define, monitor, and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to meet internal and external reliability targets
Apply software engineering practices to eliminate toil by automating operational tasks, improving overall efficiency, and contributing to the operational reliability of the platform
Develop and enhance our cloud infrastructure (GCP preferred) through automation and Infrastructure as Code (Terraform)
Develop, oversee, and maintain operational systems (from deployment pipelines to orchestration layers) ensuring application health, reliability, and scalability using containerized solutions like Kubernetes
Execute scalability and performance optimization strategies to ensure systems efficiently handle increasing workloads and future growth
Contribute to the design and implementation of highly-available and fault-tolerant systems, leveraging Service-Oriented Architecture (SOA) or microservices principles
Participate in architectural discussions that influence the platform’s long-term reliability, performance, and scalability
Collaborate with security experts to integrate IAM, authentication, authorization, encryption, and related best practices into the infrastructure
Create and maintain comprehensive documentation on system architecture, infrastructure, and security practices