Senior Software Engineer, Site Reliability Tooling

Recruiting From ScratchSan Francisco, CA
12dRemote

About The Position

Our client is a leading AI-driven lending marketplace transforming how banks and credit unions evaluate and approve borrowers. Their platform delivers higher approval rates, lower loss rates, and a seamless digital-first experience—enabling more than 80% of applicants to be automatically approved without document uploads. They operate as a digital-first company with hubs across the U.S., and employees join because they’re motivated by the mission: increasing access to fair, effortless credit by leveraging modern AI and real-time data. As a Senior Software Engineer focused on Site Reliability Tooling , you will play a key role in the reliability, resilience, and observability of large-scale production systems. You’ll design and build tools that empower engineering teams to maintain uptime, deploy safely, and understand system performance across complex microservice architectures.

Requirements

  • 6+ years combined experience in Software Engineering, Site Reliability Engineering, and/or DevOps.
  • Strong proficiency in Python, Go, and/or JavaScript/TypeScript .
  • Hands-on experience with Infrastructure-as-Code (Terraform, CDK, CloudFormation).
  • Proven background building internal tooling and applying strong software engineering fundamentals (architecture, testing, TDD).
  • Strong grounding in data structures and algorithms .
  • Experience with on-call , incident response, and incident management workflows.
  • Experience with modern observability tools such as Datadog, Prometheus, Grafana, CloudWatch .
  • Experience supporting high-scale SaaS systems in microservice cloud environments.
  • Ability to work cross-functionally to drive large engineering initiatives.
  • Data-driven mindset focused on metrics, reliability, and continuous improvement.

Nice To Haves

  • Experience with service mesh technologies .
  • Full-stack engineering capabilities.
  • Background building tooling for observability or monitoring platforms.
  • Experience leveraging LLMs / GenAI to improve SRE workflows (chatops, auto-remediation, alert summarization, etc.).

Responsibilities

  • Champion SRE principles across engineering and promote a strong culture of service ownership and reliability.
  • Build internal tooling from scratch to improve observability, monitoring, alerting, and operational workflows.
  • Implement standards to monitor microservices, web apps, mobile apps, machine learning systems, databases, and Kubernetes clusters.
  • Improve incident response processes, including on-call workflows, retrospectives, and reliability reporting.
  • Automate toil through infrastructure tooling, scripts, and scalable platform services.
  • Help define the long-term strategy for reliability, disaster preparedness, and operational risk mitigation.
  • Collaborate across multiple engineering groups to deliver enterprise-wide reliability initiatives.

Benefits

  • Comprehensive medical, dental, and vision coverage with HSA contributions
  • 401(k) with 100% match up to $4,500 (immediate vesting)
  • Employee Stock Purchase Plan
  • Life and disability insurance
  • Flexible vacation, holidays, sick leave, and safety leave
  • Parental, family care, and military leave
  • Annual wellness, technology, and ergonomic reimbursements
  • Team events, ERGs, volunteer groups
  • When onsite: catered lunches, snacks, and drinks
  • Quarterly team onsite sessions (travel covered)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service