Site Reliability Engineer

EnlyteChicago, IL
$91,000 - $110,000Hybrid

About The Position

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of critical technology services and platforms. This role emphasizes proactive response, service level management, and technical leadership in observability, with a particular focus on supporting .NET workloads running on Windows and Linux containers in AWS environments. The role is focused on the applications and technology underpinning the PartsTrader customer-facing products. The SRE will collaborate closely with technology teams to identify and remediate risks, drive continuous improvement, and maintain operational excellence in an evolving microservices architecture that requires a high degree of availability. Expected Hours: Monday - Friday (8am to 5pm), with flexibility to meet with New Zealand stakeholders as needed. Environment: Onsite 4 days/week; Remote Fridays

Requirements

  • Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent professional experience.
  • 5 Years of proven experience in site reliability engineering, incident response, and operational support for cloud-based applications.
  • Demonstrated expertise with observability and monitoring tools in microservice architectures.
  • Strong proficiency with AWS services, including EC2, ECS/EKS, CloudWatch, IAM, and networking.
  • Expert communicator in written, verbal, and diagrammatic mediums, able to effectively interact with and present to all levels of the organization.
  • Ability to get up to speed quickly in new technical or business domains.
  • Ability to work after hours or weekends as required.

Nice To Haves

  • Extensive experience in incident management, escalation procedures, and service level reporting.
  • Strong commitment to delivering exceptional service and operational excellence.
  • Ability to anticipate potential impacts, think strategically, and proceed proactively during high priority incidents.
  • Exceptional interpersonal and “soft” skills, demonstrated by building strong relationships, influencing peers and senior stakeholders, and navigating conflict to achieve successful outcomes.
  • Advanced problem analysis and solving skills for complex technical issues.
  • Familiarity with CI/CD tools, infrastructure-as-code, and automation frameworks.
  • Knowledge of container orchestration platforms (e.g., Kubernetes) and related AWS services.
  • Familiarity with AI tooling that can assist in incident response and site reliability activities.

Responsibilities

  • Incident Response & Management: Lead and participate in the full incident lifecycle, including detection, triage, escalation, resolution, and post-incident reviews. Maintain readiness for high-priority incidents and ensure timely communication and documentation.
  • Observability & Monitoring: Implement, maintain, and optimize observability tools such as New Relic for distributed microservices. Develop and refine dashboards, alerts, and analytics to proactively detect issues and improve system reliability.
  • Service Level Management: Define, measure, and report on Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). Provide regular reporting on service-health and performance to stakeholders.
  • AI‑Driven Operations: Design and operate AI‑enabled SRE workflows, including LLM‑assisted incident triage, post‑incident analysis, and runbook automation. Explore agentic approaches to reduce manual toil and improve speed and consistency of operational responses.
  • Technical Support & Troubleshooting: Provide expert support for .NET workloads deployed on Windows and Linux containers, with a focus on AWS infrastructure. Troubleshoot complex issues across applications, platforms, and network layers.
  • Continuous Improvement: Collaborate with engineering and DevOps teams to identify opportunities for automation, reliability enhancements, and process improvements. Participate in root cause analysis and implement corrective actions.
  • Documentation & Knowledge Sharing: Create and maintain technical documentation, incident records, runbooks, and best practices for operational processes.
  • Collaboration: Work effectively with cross-functional teams, including developers, QA, product managers, and business stakeholders, to ensure alignment on reliability goals and incident action plans.
  • Maintain a high level of professionalism with regard to attitude, conduct, appearance, confidentiality and service excellence.
  • Effectively engage with internal customers via email, telephone and in-person to provide guidance and support.
  • Demonstrate sense of urgency, initiative, responsiveness and attention to detail.
  • Support the technology teams in optimizing .NET applications deployed on Windows and Linux containers in AWS cloud environments to enhance reliability and supportability.
  • Configure, maintain, and enhance observability tooling frameworks for monitoring microservices, logging, and tracing.
  • Assist with deployment, scaling, and maintenance of containerized workloads using AWS ECS.
  • Serve as a technical escalation point for production issues, ensuring rapid resolution and minimal business impact.
  • Maintain and improve CI/CD pipelines and automation supporting reliable application delivery.

Benefits

  • We’re committed to supporting your ultimate well-being through our total compensation package offerings that support your health, wealth and self. These offerings include Medical, Dental, Vision, Health Savings Accounts / Flexible Spending Accounts, Life and AD&D Insurance, 401(k), Tuition Reimbursement, and an array of resources that encourage a lifetime of healthier living. Benefits eligibility may differ depending on full-time or part-time status.
  • Compensation depends on the applicable US geographic market. The expected base pay for this position ranges from $91,000 - $110,000 annually, and will be based on a number of additional factors including skills, experience, and education.
  • The Company is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service