Principal DevOps Engineer

AppZen, Inc.•San Jose, CA

About The Position

As Principal DevOps Engineer you are the most senior individual contributor on the team. You set the technical direction, own the hardest infrastructure and reliability problems end-to-end, and lift the entire org through architecture, code, design reviews, and mentorship. You partner closely with the DevOps Manager and engineering leadership on roadmap and standards, but your scorecard is technical outcomes — not headcount. Expect roughly 70-80% deep hands-on engineering (Terraform, Kubernetes, Postgres, Elasticsearch, pipelines, incident command) and 20-30% technical leadership: design reviews, mentorship, cross-team alignment, and writing the standards others build against.

Requirements

10+ years of experience in DevOps, SRE, infrastructure, or platform engineering, with at least 3 years operating at a Staff or Principal level (or equivalent technical leadership scope).
Deep, hands-on AWS expertise across compute, networking, IAM, data, and observability services; demonstrated ownership of multi-account, multi-region SaaS architectures.
Strong production experience with Kubernetes (preferably EKS), including upgrades, autoscaling, and securing multi-tenant clusters.
Demonstrated hands-on operations experience with PostgreSQL at scale — query and index tuning, replication, HA/failover, backups, and version upgrades — and with Elasticsearch / OpenSearch (cluster sizing, shard strategy, ingest tuning, and incident response).
Working knowledge of additional datastores commonly used in SaaS: Redis, Kafka or other message brokers, and object storage; comfortable evaluating trade-offs between managed services (RDS, Aurora, ElastiCache, MSK, OpenSearch Service) and self-managed options.
Expert with Terraform and modern IaC patterns; clear opinions on module design, state management, and PR-driven workflows.
Strong scripting and automation skills in at least one of Python, Go, or Bash; comfortable contributing real code, not just reviewing.
Track record of designing and operating CI/CD pipelines at scale (GitHub Actions, Jenkins, ArgoCD, or similar).
Experience running production workloads under SOC 2 or comparable compliance frameworks; comfortable partnering with Security on audits and remediation.
Demonstrated technical leadership without formal authority: writing decision-grade design docs, mentoring engineers, and influencing across teams. You enjoy lifting others through your work.

Nice To Haves

Experience supporting AI/ML or data-heavy SaaS workloads (GPU fleets, vector stores, large async pipelines).
Familiarity with service mesh (Istio, Linkerd) and progressive delivery (Argo Rollouts, feature flags).
Background scaling FinOps practices and managing cloud spend at $5M+ annual run-rate.
Experience operating multi-tenant SaaS with strict data isolation requirements for enterprise finance customers.
Exposure to multi-cloud or hybrid-cloud environments (Azure, GCP).
Open-source contributions, conference talks, or internal tech-leadership artifacts (eng wikis, RFCs, paved-road frameworks).

Responsibilities

Set technical direction
Own the architecture for AppZen's cloud platform: AWS topology, Kubernetes design, datastore strategy, CI/CD, and observability — make the long-horizon calls and write the design docs the rest of engineering builds against.
Lead deep design reviews; set bar-raising standards for reliability, security, performance, and cost across infrastructure code and production systems.
Identify the highest-leverage platform investments (toil reduction, reliability, developer velocity) and drive them from idea to rollout.
Drive AWS architecture and operations across multiple regions and accounts; own multi-account landing-zone, IAM, and network patterns.
Set the Terraform module and IaC patterns the team uses; lead the hardest migrations and cleanups personally.
Partner with Security on SOC 2, ISO 27001, GDPR, and customer audit requirements; design controls for IAM, network, and secrets management.
Drive cloud cost engineering: visibility, forecasting, and optimization (Savings Plans, rightsizing, multi-tenant efficiency).
Be the team's go-to expert on PostgreSQL in production: schema and index strategy, query tuning, vacuum/bloat, replication, failover, point-in-time recovery, and major-version upgrades on RDS / Aurora.
Own scaling and reliability of Elasticsearch / OpenSearch: shard and index design, JVM/heap tuning, snapshot strategy, hot-warm tiers, and incident response under heavy ingest or query load.
Set patterns for supporting datastores: Redis (caching, queues), Kafka or SQS/SNS (streaming and async), and S3-backed data lakes — including HA, durability, and disaster recovery.
Lead capacity planning, performance benchmarking, data-tier cost optimization, backup/restore drills, and customer data isolation for multi-tenant workloads.
Own the architecture of our EKS-based Kubernetes platform: cluster lifecycle, autoscaling, multi-tenancy, and workload isolation.
Define the golden paths service teams use — Helm, Kustomize, and GitOps tooling such as ArgoCD or Flux — and personally build the trickiest pieces.
Set patterns for service mesh, ingress, and zero-downtime deployments.
Architect internal developer platform capabilities so product teams ship safely and quickly without infra friction.
Drive the design of build, test, and deploy pipelines (e.g., GitHub Actions, Jenkins, ArgoCD); enforce supply-chain security and artifact provenance.
Set the bar for DORA metrics: lead time, deploy frequency, change failure rate, and MTTR — and own the highest-impact improvements.
Architect the observability stack (e.g., Datadog, Prometheus, Grafana, OpenTelemetry); define metrics, logs, and tracing standards across services.
Define and operationalize SLOs and error budgets in partnership with service owners.
Act as incident commander for high-severity events; lead blameless post-mortems and convert learnings into durable systemic fixes.
Mentor senior and staff engineers; raise the bar through code and design reviews, pairing, and writing the references docs and run books others learn from.
Represent Cloud Engineering in cross-team forums; influence Product Engineering, Security, and Data on architecture and reliability decisions without authority.
Help the DevOps Manager hire — calibrate technical bar, design interview loops, and close senior candidates.