Senior Site Reliability Engineer

Prove

125d•$165,000 - $180,000

About The Position

We are seeking an experienced Senior Site Reliability Engineer to join our Platform Engineering team. In this role, you will be instrumental in designing, implementing, maintaining and deploying highly available complex, scalable and reliable systems leveraging automation, effective monitoring and infrastructure-as code. Working closely with our application engineering teams to ensure our services meet the highest standards of reliability, performance, and security.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Expert knowledge of observability platforms and practices (OpenTelemetry, Prometheus, Grafana, Jaeger, ELK stack / Splunk, etc)
Experience with Kubernetes and container orchestration
Strong experience with infrastructure-as-code tools (Terraform, CloudFormation, Pulumi)
Proficiency in at least one programming language (Go, Python, Java)
Deep understanding of cloud platforms (AWS, GCP, or Azure)
Experience implementing and managing CI/CD pipelines
Knowledge of network architecture and security principles
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

Nice To Haves

Experience with distributed systems and microservice architectures
Knowledge of security best practices and compliance requirements
Experience with database administration (PostgreSQL, MySQL)
Familiarity with service mesh technologies (Istio, Linkerd)
Contributions to open-source projects
AWS/GCP certifications
Experience in the identity verification or financial technology industry
Hands-on experience with OpenTelemetry standards and technology
Application development experience

Responsibilities

Design and implement comprehensive observability solutions across our infrastructure and within applications
Lead the initiative to establish a companywide instrumentation standard based in Opentelemetry wide events
Build advanced monitoring dashboards that provide real-time visibility into system health and performance
Establish metrics, logging, and tracing systems that enable quick identification and resolution of issues
Create alerting thresholds and automated responses based on service level objectives (SLOs)
Drive a culture of observability throughout the engineering organization
Lead Kubernetes cluster management, optimization, and scaling initiatives
Design and implement infrastructure-as-code deployments for container-based applications
Optimize container resource allocation and utilization
Build automated deployment pipelines that ensure consistent, reliable releases
Establish best practices for containerization and orchestration across teams
Design, build, and maintain scalable cloud infrastructure on AWS
Implement infrastructure-as-code using tools such as Terraform
Automate routine operational tasks to reduce toil and improve efficiency
Ensure infrastructure security compliance and implement least-privilege access controls
Optimize cloud resource utilization and costs
Integrate observability-driven alerts with our Incident Management systems
Lead incident response efforts during service disruptions
Conduct thorough post-incident reviews and implement preventative measures
Use observability data to perform root cause analysis and system improvements
Document incidents, responses, and lessons learned to build organizational knowledge
Identify and resolve performance bottlenecks across the technology stack
Conduct capacity planning and scaling exercises to meet future demands
Implement auto-scaling solutions based on performance metrics
Optimize database performance and query efficiency
Design and implement application stress testing methods and systems

Benefits

Competitive salaries & Bonus Plan (for eligible roles) and Equity Plan
Modern Health for financial, mental, and physical wellness
401(k) Retirement Plan & Match (US Offices) and Local Country Pension (International Offices)
Unlimited Vacation and Flexible hours
Comprehensive medical benefits for you and your family
Emotional & Physical Wellness – Access to wellness services (EAP & Prove Well-Being Reimbursement)
Bottomless snacks & beverages for certain office locations
Daily GrubHub stipend for lunch if coming into the office (US Offices)
A great place to work and connect with other talented Provers like yourself!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Bachelor's degree

Number of Employees

251-500 employees

Senior Site Reliability Engineer

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company