Site Reliability Engineer

TEKsystemsChandler, AZ
Hybrid

About The Position

Seeking an experienced Site Reliability Engineer (SRE) to provide production support and reliability engineering for multiple enterprise-wide solutions within the Infrastructure Automation Solutions (IAS) organization. This role is responsible for ensuring the availability, performance, resiliency, and scalability of critical platforms supporting enterprise security, automation, orchestration, CI/CD pipelines, and cloud-based services. The Sr SRE will partner closely with Product Managers, Engineering Leads, platform teams, and other SREs to design, build, operate, and continuously improve highly reliable enterprise solutions.

Requirements

  • 5+ years of experience supporting enterprise-scale production systems with a focus on reliability, operations, and automation.
  • High-level experience (5–10 years) supporting one or more of the following: Enterprise security platforms, Automation and orchestration solutions, Workflow automation, CI/CD pipelines, Cloud platforms (public or private)
  • Strong hands-on experience with: Linux and Windows administration, Ansible / Ansible Tower, Terraform
  • Proficiency in at least one or more programming or scripting languages: Python, .NET, Java
  • Experience supporting and troubleshooting enterprise logging and monitoring platforms, such as: Tivoli ITM, SiteScope, Splunk
  • Experience with Dynatrace Application Performance Monitoring (APM).
  • Experience with ITSM tools, including ServiceNow and/or BMC Remedy.
  • Strong troubleshooting, debugging, and root-cause analysis skills in complex enterprise environments.
  • Proven ability to collaborate across teams and communicate effectively with both technical and non-technical stakeholders.

Nice To Haves

  • Experience supporting large-scale financial services or regulated enterprise environments.
  • Familiarity with cloud-native architectures and DevOps/SRE best practices.
  • Experience developing dashboards and operational insights using Tableau.
  • Exposure to SRE concepts including SLIs, SLOs, error budgets, and reliability metrics.
  • Experience improving operational maturity through automation, standardization, and observability.
  • Bachelor’s degree in Computer Science, Engineering, or a related discipline (or equivalent experience).

Responsibilities

  • Provide 24x7 production support (including on-call rotation) for multiple enterprise platforms and automation solutions.
  • Ensure stability, availability, performance, and resiliency of enterprise security, automation, orchestration, and CI/CD platforms.
  • Act as an escalation point for incident management, root cause analysis, and problem resolution.
  • Lead and contribute to post-incident reviews, documenting root causes and driving corrective and preventive actions.
  • Collaborate with Product, Engineering, and Architecture teams to influence design decisions that improve reliability, scalability, and operability.
  • Automate repetitive operational tasks and implement self-healing and auto-remediation solutions using infrastructure-as-code and workflow automation.
  • Support and operate enterprise tools including, but not limited to: Security & Endpoint Platforms: Tanium, CrowdStrike; Automation & Configuration Management: Ansible, Ansible Tower, Terraform, BMC Bladelogic; Orchestration & Workflow: BMC TrueSight Orchestrator; CI/CD & Artifact Management: JFrog Artifactory and Xray; Endpoint Management: Microsoft SCCM
  • Monitor system health and performance using enterprise monitoring, logging, and APM tools.
  • Partner with service management teams to manage incidents, changes, and problem records using ServiceNow and BMC Remedy.
  • Create and maintain operational documentation, runbooks, dashboards, and standard operating procedures.
  • Contribute to capacity planning, performance tuning, and platform modernization efforts.
  • Support cloud-based platforms and hybrid environments with a reliability-first mindset.

Benefits

  • Medical, dental & vision
  • Critical Illness, Accident, and Hospital
  • 401(k) Retirement Plan – Pre-tax and Roth post-tax contributions available
  • Life Insurance (Voluntary Life & AD&D for the employee and dependents)
  • Short and long-term disability
  • Health Spending Account (HSA)
  • Transportation benefits
  • Employee Assistance Program
  • Time Off/Leave (PTO, Vacation or Sick Leave)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service