Site Reliability Engineer

mthreeJersey City, NJ
$140,000 - $170,000Onsite

About The Position

Our client is seeking a highly motivated Site Reliability Engineer responsible for ensuring reliability, scalability, and performance of large-scale systems and applications. The role blends software engineering, infrastructure engineering, and production support, with a strong focus on automation and observability. mthree is a technology and business consultancy with a global workforce delivering significant business and IT projects in some of the largest financial services organizations worldwide. Our Expert program offers experienced professionals access to top roles in tech, finance, aviation and insurance. Join us to work on groundbreaking technology projects, from international trading platforms to critical applications for leading airlines. We recruit professionals who are eager to fast-track their careers in technology or operations within prestigious global organizations.

Requirements

  • ~10–15+ years in SRE, software engineering, or infrastructure engineering
  • Strong experience with cloud platforms (AWS/Azure)
  • Proven experience supporting large-scale distributed systems
  • Programming: Python, Java, or .NET
  • DevOps: CI/CD tools (Jenkins, Git), GitOps
  • Observability: Splunk, Prometheus, Grafana, Dynatrace
  • Systems: Linux/Unix, networking, load balancing, DNS
  • Service Level Indicators (SLIs) & Objectives (SLOs)
  • Error budgets and reliability engineering practices
  • Incident response and resiliency engineering
  • Strong collaboration and stakeholder management
  • Ability to lead initiatives and influence engineering culture
  • Problem-solving in high-pressure production environments
  • Currently authorized to work in the United States on a full-time basis

Responsibilities

  • Define and track service reliability goals (SLIs/SLOs) across applications
  • Ensure high availability, scalability, and performance of systems
  • Own production issues end-to-end and ensure problems do not recur
  • Design monitoring, logging, and tracing systems (dashboards, alerts)
  • Enhance operational visibility into platform performance
  • Evaluate and improve monitoring coverage for new releases
  • Automate manual operational tasks and workflows
  • Build tools/software to reduce “toil” and improve efficiency
  • Implement CI/CD pipelines and automation frameworks
  • Participate in major incident triage and troubleshooting
  • Identify and resolve root causes of complex outages
  • Collaborate with problem management teams to prevent recurrence
  • Work closely with software engineering, infrastructure, and architecture teams
  • Influence adoption of reliable design patterns and best practices
  • Drive early integration of non-functional requirements (reliability, scalability)
  • Identify bottlenecks, capacity constraints, and vulnerabilities
  • Optimize system performance and cost efficiency
  • Plan for growth and scaling needs

Benefits

  • Comprehensive benefits package
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service