Site Reliability Engineer

mthree•Jersey City, NJ

10h•$140,000 - $170,000•Onsite

About The Position

Our client is seeking a highly motivated Site Reliability Engineer responsible for ensuring reliability, scalability, and performance of large-scale systems and applications. The role blends software engineering, infrastructure engineering, and production support, with a strong focus on automation and observability. mthree is a technology and business consultancy with a global workforce delivering significant business and IT projects in some of the largest financial services organizations worldwide. Our Expert program offers experienced professionals access to top roles in tech, finance, aviation and insurance. Join us to work on groundbreaking technology projects, from international trading platforms to critical applications for leading airlines. We recruit professionals who are eager to fast-track their careers in technology or operations within prestigious global organizations.

Requirements

~10–15+ years in SRE, software engineering, or infrastructure engineering
Strong experience with cloud platforms (AWS/Azure)
Proven experience supporting large-scale distributed systems
Programming: Python, Java, or .NET
DevOps: CI/CD tools (Jenkins, Git), GitOps
Observability: Splunk, Prometheus, Grafana, Dynatrace
Systems: Linux/Unix, networking, load balancing, DNS
Service Level Indicators (SLIs) & Objectives (SLOs)
Error budgets and reliability engineering practices
Incident response and resiliency engineering
Strong collaboration and stakeholder management
Ability to lead initiatives and influence engineering culture
Problem-solving in high-pressure production environments
Currently authorized to work in the United States on a full-time basis

Responsibilities

Define and track service reliability goals (SLIs/SLOs) across applications
Ensure high availability, scalability, and performance of systems
Own production issues end-to-end and ensure problems do not recur
Design monitoring, logging, and tracing systems (dashboards, alerts)
Enhance operational visibility into platform performance
Evaluate and improve monitoring coverage for new releases
Automate manual operational tasks and workflows
Build tools/software to reduce “toil” and improve efficiency
Implement CI/CD pipelines and automation frameworks
Participate in major incident triage and troubleshooting
Identify and resolve root causes of complex outages
Collaborate with problem management teams to prevent recurrence
Work closely with software engineering, infrastructure, and architecture teams
Influence adoption of reliable design patterns and best practices
Drive early integration of non-functional requirements (reliability, scalability)
Identify bottlenecks, capacity constraints, and vulnerabilities
Optimize system performance and cost efficiency
Plan for growth and scaling needs