VP of Site Reliability

Titan AI
Hybrid

About The Position

Titan builds AI software for banks: purpose-built small language models, a banking ontology, and AI bankers that financial institutions can trust. Our models outperform general-purpose LLMs by 30 to 80 percent on banking tasks. Customers include community banks, credit unions, and large regional and super-regional institutions. We are backed by leading fintech investors and operate under the compliance, audit, and model-risk standards that banking requires. Titan is scaling from a handful of live banking customers to thirty, then to hundreds. Each bank deploys differently: Azure, private cloud, or the bank's existing infrastructure. The core problem this role solves is making the platform work consistently and reliably across all of them, managing the last-mile deployment complexity that grows with every new customer. This is a hands-on, principal-level role. You are not coming in to build an org chart. You are coming in to do the work: write the runbooks, stand up the on-call rotation, own incident command when a bank has an outage, and build the deployment playbook that takes us from client 10 to client 350. The practices get built before the teams do.

Requirements

  • Ten or more years in engineering, with at least five years personally building SRE or platform operations functions at a software company selling into enterprise or regulated markets.
  • Experience managing multi-tenant and multi-deployment-model infrastructure.
  • Experience with the last-mile complexity that comes with multi-tenant and multi-deployment-model infrastructure.
  • Written SLOs that people actually use.
  • Stood up an on-call rotation from nothing.
  • Been the technical owner during a production incident and know what it costs to not have a process.
  • Earn trust from senior engineers without leaning on title.
  • See process as leverage, not overhead.
  • Not here to manage, but here to build.

Nice To Haves

  • Has not spent their career in internal bank IT.
  • Comes from companies that ship software to customers and operate it at scale: ServiceNow, MongoDB, AWS, GCP, or comparable.

Responsibilities

  • Build the SRE practice and operate it yourself first: SLO framework, on-call rotation, and incident command process.
  • Write the SLOs, run the rotation, lead incident response at live bank customers, and produce the postmortems.
  • Define severity tiers, SLA commitments per customer tier, and escalation paths, and route alerts into a real queue.
  • Act as the technical accountable owner when a bank has a production incident.
  • Set the operating system across all four engineering lanes: sprint discipline, release rituals, code review standards, change management evidence, and the metrics the CEO and board read monthly.
  • Own the SOC 2 artifacts, model risk review documentation, and the change traceability that bank examiners scrutinize.

Benefits

  • Competitive base and meaningful equity.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service