Lead SRE/DevOps Engineer

Synechron IncPittsburgh, PA
37d

About The Position

We are seeking a highly skilled Lead Site Reliability Engineer (SRE) / DevOps Engineer to drive the reliability, observability, and operational excellence of our platforms. This role will lead major initiatives around monitoring, automation, incident response, and performance optimization leveraging enterprise tools such as Dynatrace, BigPanda, and LogScale/MonPro. Candidate will partner closely with engineering, operations, and product teams to build robust systems, improve service availability, and ensure a seamless user experience through proactive observability and best-in-class SRE practices.

Requirements

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
  • Hands-on expertise with observability/monitoring tools such as: Dynatrace (APM, RUM, dashboards, alerting) BigPanda (event correlation, incident response) LogScale / MonPro / LogicMonitor or similar log and metrics platforms
  • Solid experience with cloud platforms (AWS, Azure, or GCP).
  • Strong proficiency in automation & orchestration (Terraform, Ansible, Jenkins, GitHub Actions, etc.).
  • Proven track record in incident management, RCA, and implementing reliable SRE practices.
  • Experience with CI/CD pipelines, infrastructure as code, and configuration management.
  • Deep understanding of Linux systems, networking fundamentals, and distributed system design.
  • Strong scripting abilities (Python, Bash, PowerShell, or equivalent).
  • Excellent communication, leadership, and cross-team collaboration skills.

Nice To Haves

  • Experience leading SRE or DevOps teams.
  • Knowledge of chaos engineering, advanced anomaly detection, and proactive alerting strategies.
  • Experience implementing SLI/SLO frameworks and performance optimization programs.
  • Familiarity with containerization (Docker, Kubernetes) and service meshes.

Responsibilities

  • Implement and enhance proactive observability frameworks to anticipate and mitigate issues before they occur.
  • Optimize experience monitoring and user interaction metrics across applications and services.
  • Manage and improve the event catalog, ensuring all system events are structured and actionable.
  • Build and maintain dashboards, alerts, and health reporting using tools like Dynatrace, BigPanda, MonPro, and LogScale.
  • Perform service tuning to improve system performance based on real-time metrics and data analysis.
  • Establish and maintain observability standards and best practices across teams.
  • Conduct chaos testing and resilience validation to ensure high system availability.
  • Lead anomaly detection practices to quickly identify and respond to unusual system behavior.
  • Ensure platform stability, performance, and reliability through proven reliability engineering principles.
  • Drive SRE initiatives, including continuous improvement projects within the Site Reliability Center.
  • Develop, maintain, and scale automated orchestration pipelines to streamline operations and improve efficiency.
  • Create, maintain, and enforce SRE standards, including SLIs, SLOs, and operational playbooks.
  • Lead and conduct root cause analysis for critical incidents and drive long-term remediation improvements.
  • Own the problem management lifecycle-identifying, tracking, and resolving underlying issues to prevent recurring incidents.
  • Collaborate with cross-functional teams to address systemic issues and drive operational resilience.

Benefits

  • A highly competitive compensation and benefits package.
  • A multinational organization with 58 offices in 21 countries and the possibility to work abroad.
  • 10 days of paid annual leave (plus sick leave and national holidays).
  • Maternity & paternity leave plans.
  • A comprehensive insurance plan including medical, dental, vision, life insurance, and long-/short-term disability (plans vary by region).
  • Retirement savings plans.
  • A higher education certification policy.
  • Commuter benefits (varies by region).
  • Extensive training opportunities, focused on skills, substantive knowledge, and personal development.
  • On-demand Udemy for Business for all Synechron employees with free access to more than 5000 curated courses.
  • Coaching opportunities with experienced colleagues from our Financial Innovation Labs (FinLabs) and Center of Excellences (CoE) groups.
  • Cutting edge projects at the world's leading tier-one banks, financial institutions and insurance firms.
  • A flat and approachable organization.
  • A truly diverse, fun-loving, and global work culture.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Professional, Scientific, and Technical Services

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service