Principal SRE

Gradial•Seattle, WA

52d•$180,000 - $240,000

About The Position

Gradial is a Seattle-based startup enabling digital experiences at the speed of thought. We empower marketers and creatives to implement their ideas directly, with software that adapts over time. Our platform automates website and design system updates, large-scale migrations to new design systems, and continuous content optimization while adhering to company and product brands. Backed by world class investors, we’re looking to scale our platform and expand our team. At Gradial, we operate with extreme ownership, bias towards action and critical path planning. We tackle problems from first principles, question assumptions, and find creative solutions. If you want to take risks, work on groundbreaking technology, and see the direct impact of your work, Gradial is where you belong. The Role As a Principal Site Reliability Engineer at Gradial, you will shape the foundation our platform runs on as we scale. You will work closely with the CTO and engineering team to make our systems faster, more resilient, and easier to operate in a high-growth environment. This is a hands-on IC leadership role for someone who wants real ownership, high leverage, and the chance to define how reliability looks at an AI-native company.

Requirements

5+ years of experience in SRE, DevOps, platform engineering, or infrastructure roles with direct ownership of production systems.
Proven success designing and operating production-grade infrastructure in fast-moving, high-growth environments.
Deep expertise in Kubernetes, cloud-native architecture, and container orchestration.
Strong experience with infrastructure as code, GitOps, CI/CD workflows, and modern deployment practices.
Strong command of observability and reliability fundamentals across metrics, logging, tracing, alerting, and incident response.
A track record of leading through influence, making sound technical decisions, and raising the bar across engineering teams.

Nice To Haves

Familiarity with AI or ML infrastructure, including GPU provisioning, model deployment, or compute-intensive workloads.
Experience supporting cloud or multi-cloud environments with a focus on resilience and scale.
Comfort with TypeScript or Python for internal tooling and operational automation.

Responsibilities

Own the reliability, scalability, and operational health of Gradial’s production platform.
Lead the evolution of Kubernetes, CI/CD, observability, and infrastructure as code across the stack.
Set the standard for how we design, ship, and operate reliable systems.
Build the tooling and automation that help engineers move faster with more confidence.
Drive improvements in monitoring, alerting, incident response, and service readiness.
Partner with engineering to spot scaling risks early and solve them before they slow us down.
Influence the long-term direction of our platform across reliability, security, performance, and cost.