Principal Platform Engineer — Kubernetes & Cloud Infrastructure

Ombud•Quinte West, ON

10h•Hybrid

About The Position

Ombud's platform runs production AI workloads for enterprise customers, and we're scaling toward a self-service motion where customers onboard, ingest content, and operate the product without manual implementation. This requires an infrastructure foundation that can handle multi-tenant scale, high reliability, and the unique demands of generative AI workloads — without ballooning the AWS bill. We're hiring a Principal Platform Engineer to own that foundation. This is a senior individual contributor role with broad architectural authority. You will not have direct reports. You will set the technical direction for our cloud infrastructure, partner with engineering on production scaling decisions, and operate the platform with the discipline a SOC 2 / ISO 27001 customer base requires.

Requirements

8+ years of platform, infrastructure, SRE, or DevOps experience, with at least 3+ years operating production Kubernetes at scale.
Deep AWS expertise across compute, storage, networking, data services, and IAM.
Production fluency with Terraform, Docker, Linux, and CI/CD systems.
Track record of architectural decisions that materially improved reliability, cost, or developer velocity — with specific, measurable outcomes you can point to.
Comfort operating as a senior IC who sets technical direction across teams without formal authority.
Strong written communication — runbooks, architecture decision records, post-incident reviews.
Willingness to be in-office Tuesday through Thursday in Denver.

Nice To Haves

Production experience supporting generative AI or ML workloads (GPU node groups, vector databases, model serving).
Experience with Qdrant, Pinecone, Weaviate, or other vector stores in production.
PostgreSQL operational depth — replication, performance tuning, backup/restore.
Experience scaling a multi-tenant SaaS platform from ~100 customers to ~1,000.
SOC 2 Type II and ISO 27001 audit experience.
Familiarity with event-driven architectures (Kafka, Kinesis, or equivalent).

Responsibilities

Production Kubernetes (EKS) clusters: capacity planning, node group strategy, gen-AI workload isolation, blast-radius containment.
AWS infrastructure end-to-end: RDS, DMS, Kafka (MSK), ECR, networking, IAM, multi-region deployments (including Ireland for EU data residency).
Infrastructure-as-code in Terraform — modules, environments, drift management, peer review.
CI/CD pipelines (Jenkins, GitHub Actions, or your recommended replacement) — fast, reliable, secure builds for backend and frontend services.
Observability: Grafana dashboards, Prometheus metrics, log pipelines, on-call alerting, SLO definition.
Cost optimization. AWS spend is one of our top three variable costs. Reducing it by 20% is a tangible objective for this seat.
Security posture: secrets management (Consul/Vault), IAM hygiene, vulnerability patching, support for SOC 2 and ISO 27001 audit cycles.
Architecture leadership on the self-service infrastructure roadmap: how we onboard a customer without human intervention and scale to 10x our current tenant count.
Documentation and runbooks that let the rest of the engineering team operate the platform when you're unavailable.

Benefits

The infrastructure decisions you make will directly enable our 2026 strategy of moving from response management to autonomous revenue execution.
You'll work with a small, senior engineering team that ships fast and trusts each other.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume