VP of Site Reliability

Titan AI

6d•Hybrid

About The Position

Titan builds AI software for banks: purpose-built small language models, a banking ontology, and AI bankers that financial institutions can trust. Our models outperform general-purpose LLMs by 30 to 80 percent on banking tasks. Customers include community banks, credit unions, and large regional and super-regional institutions. We are backed by leading fintech investors and operate under the compliance, audit, and model-risk standards that banking requires. Titan is scaling from a handful of live banking customers to thirty, then to hundreds. Each bank deploys differently: Azure, private cloud, or the bank's existing infrastructure. The core problem this role solves is making the platform work consistently and reliably across all of them, managing the last-mile deployment complexity that grows with every new customer. This is a hands-on, principal-level role. You are not coming in to build an org chart. You are coming in to do the work: write the runbooks, stand up the on-call rotation, own incident command when a bank has an outage, and build the deployment playbook that takes us from client 10 to client 350. The practices get built before the teams do.

Requirements

Ten or more years in engineering, with at least five years personally building SRE or platform operations functions at a software company selling into enterprise or regulated markets.
Experience managing multi-tenant and multi-deployment-model infrastructure.
Experience with the last-mile complexity that comes with multi-tenant and multi-deployment-model infrastructure.
Written SLOs that people actually use.
Stood up an on-call rotation from nothing.
Been the technical owner during a production incident and know what it costs to not have a process.
Earn trust from senior engineers without leaning on title.
See process as leverage, not overhead.
Not here to manage, but here to build.

Nice To Haves

Has not spent their career in internal bank IT.
Comes from companies that ship software to customers and operate it at scale: ServiceNow, MongoDB, AWS, GCP, or comparable.

Responsibilities

Build the SRE practice and operate it yourself first: SLO framework, on-call rotation, and incident command process.
Write the SLOs, run the rotation, lead incident response at live bank customers, and produce the postmortems.
Define severity tiers, SLA commitments per customer tier, and escalation paths, and route alerts into a real queue.
Act as the technical accountable owner when a bank has a production incident.
Set the operating system across all four engineering lanes: sprint discipline, release rituals, code review standards, change management evidence, and the metrics the CEO and board read monthly.
Own the SOC 2 artifacts, model risk review documentation, and the change traceability that bank examiners scrutinize.