Founding Infrastructure / Platform Engineer

Silkline•Chicago, IL

About The Position

At Silkline, we have a bold mission — bring transparency and reliability to global supply chains. We are starting by building the modern operating system for procurement, helping businesses understand their supply networks and adapt to disruption. Silkline is a seed-stage tech startup backed by top venture investors. This past year, we have grown our customer base 5x, and we are looking to grow rapidly. Join us on the ground floor as we build the platform that the rest of the company will stand on. Our customer base has grown 5x in the last year. Customers like Astranis, Castelion, Machina Labs, and many others now save dozens of hours per week on Silkline. Their expectations have grown with us, and part-time infrastructure ownership won't get us through the next chapter. We're hiring our first dedicated infrastructure engineer to own how Silkline runs in production. Our footprint: Next.js, Hasura, Postgres, Trigger.dev — all running in an EKS cluster managed 100% via Pulumi IaC across AWS GovCloud - and it's growing fast. Today, every engineer touches infra a little; nobody owns it. You will. You'll decide what we build, buy, and rip out, and what production discipline looks like here. Our customers operate in regulated manufacturing, and they care deeply about our data handling, compliance, and security. Silkline is already SOC 2 Type I, ITAR-compliant, and on the uncommon path to becoming FedRAMP Moderate compliant. Compliance is a core product surface - it ships, it scales, and it unlocks deals. Your work is the next chapter: full FedRAMP authorization, and whatever our customers ask for next. You will report to the CTO and partner with the whole engineering team. No infra manager above you — you are the function.

Requirements

5+ years in infrastructure, platform, or SRE roles at companies that care about reliability, including some early-stage experience where you wore many hats
Deep Pulumi expertise — or deep Terraform / CDK and willingness to go all-in on Pulumi. You think in IaC, ship reusable components, and can read a state file
AWS depth — not "used EKS once" depth, but "debugged VPC routing at 2am and have strong opinions on IAM" depth
Postgres operations chops — you've tuned queries, run replication, migrated tens of millions of rows without downtime, and know what `pg_stat_statements` is for
Production discipline — you've owned an on-call rotation, written useful postmortems, and know the difference between a real SLO and a vanity SLO
Builder mindset — you'd rather build the right abstraction than wire up another state-management script. You ship code, not just configs.
Comfort with ambiguity and early-stage chaos — priorities shift, scope shifts, and you own the outcome anyway

Nice To Haves

Took a company through FedRAMP (Moderate or High), CMMC, NIST 800-171, or maintained SOC 2 Type II at scale
AWS GovCloud (US) production experience
Familiarity with handling export-controlled (ITAR / EAR) or CUI data in cloud environments
Multi-tenant SaaS with per-customer SSO, data isolation, or compliance scoping (we run per-organization Auth0 tenants today)
Trigger.dev, Inngest, Temporal, or other durable-execution background-job platforms
Next.js production deployment experience (edge, ISR, preview environments)

Responsibilities

Owned the architecture and operation of Silkline's cloud platform on AWS, codified in Pulumi
Built the data infrastructure practice — replication, backups, performance, and disaster recovery for Postgres
Built the developer platform that lets a small team ship like a larger one — CI/CD, preview environments, deploy safety, codegen pipelines, local dev parity
Matured the observability and on-call practice on Datadog, OpenTelemetry, Posthog, and Sentry — real SLOs, useful dashboards, postmortems people actually read
Maintained SOC 2 and ITAR, expanded our AWS GovCloud footprint, and led the company through FedRAMP authorization and the CMMC / NIST 800-171 controls our customers ask for
Made Trigger.dev a dependable backbone for syncs and integrations, with clear recovery paths when things break
Made our AI infrastructure (Bedrock, Cerebras, Braintrust evals) reliable and observable as it becomes a bigger part of the product