Site Reliability Engineer II

GartnerIrving, TX
12h$88,000 - $137,000Hybrid

About The Position

Join a world-class team of skilled engineers who build creative digital solutions to support our colleagues and clients. We make a broad organizational impact by delivering cutting-edge technology solutions that power Gartner. Gartner IT values its culture of nonstop innovation, an outcome-driven approach to success, and the notion that great ideas can come from anyone on the team. Gartner is looking for a Site Reliability Engineer to join our collaborative, Agile team. This position will improve Gartner’s customer experience and increase the value of our products by increasing the reliability and performance of our client-facing application and service offerings.

Requirements

  • 5+ years of information technology experience with 3+ years working on DevOps/SRE team or similar
  • Experience with incident and response management.
  • Experience with AWS cloud, specifically services such as EC2, EKS, API GW, Lambda, etc. or similar cloud technologies & services
  • Experience with back-end technologies such as J2EE, JDBC, Tomcat, .NET Core/ C#, Spring, Hibernate, etc.
  • Experience with building tools to automate production support activities that enable efficiency and productivity of Support teams
  • Prior experience in working as a Cloud DevOps Engineer, Build & Release Engineer, System Administrator is preferred.
  • Prior experience in Integrated Docker container orchestration framework using Kubernetes by creating pods, config Maps, deployments using Jenkins
  • Working knowledge of client-side technologies such as NodeJS/ JavaScript / React JQuery
  • Experience with troubleshooting, root-cause analysis, application design, and implementing components .
  • Working experience with monitoring tools like Splunk and APM tools such as Dynatrace, DataDog, New Relic, AppDynamics, etc.
  • Working knowledge of production support processes such as incident/change/problem management, call triaging and escalation procedures.
  • Exposure on Akamai/Cloudflare/Cloudfront as CDN
  • Strong Operating Systems (UNIX/Linux) background.

Nice To Haves

  • Exposure to Performance Engineering concepts
  • Exposure to chaos testing or chaos engineering
  • Experience in collaborating with Dev/DBA/Architecture teams or other relevant teams and performing root cause analysis with good working knowledge of application, processes, operating system
  • Advanced analytical, problem-solving skills, oral and written communication skills
  • Highly adaptable to changing circumstances.
  • Interested and capable in continuously learning new skills and technologies.

Responsibilities

  • Measure performance against SLOs in partnership with stakeholders, and ensure systems continue to meet SLOs over time.
  • Work to improve performance, scalability, and stability of applications.
  • Participate in operational support and on-call rotation shifts for supported systems and products.
  • Respond to incidents in production and help triage the application/system issues and identify root causes or remediations to help restore services quickly
  • Conduct blameless post mortems to troubleshoot priority incidents.
  • Use automation to reduce the probability and/or impact of problem recurrence.
  • Identify and evaluate alerting posture
  • Create dashboards and reports to communicate key metrics.
  • Implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation
  • Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams, and project management teams.
  • Participate in continuous improvement in software quality and infrastructure reliability and resilience.
  • Build and maintain documentation for all assigned projects.
  • Build and maintain performance testing frameworks, tools, and methodologies
  • Automate manual operational work (i.e., “toil”) using pipelines or by using new software or any other appropriate mechanisms
  • Conduct analytics on previous incidents to understand root causes and better predict and prevent future issues.
  • Keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
  • Participate with stakeholders such as Dev teams or product owners to define service level objectives (SLOs) for application & system operations.
  • Collaborate with development teams to promote the concept of reliability engineering during all phases of the SDLC to detect and correct performance issues and meet availability goals.

Benefits

  • Competitive compensation.
  • Limitless growth and learning opportunities.
  • A collaborative and positive culture - join a diverse team of professionals that are as smart and driven as you.
  • A chance to make an impact – your work will contribute directly to our strategy.
  • Hybrid Work Environment - enjoy the flexibility of working from home and the energy of collaborating with peers in our dynamic offices.
  • 20+ PTO days plus holidays and floating holidays in your first year.
  • Extensive medical, dental insurance and vision plan.
  • 401K with corporate match, immediate vesting.
  • Health-and-wellness-related allowance programs.
  • Parental leave.
  • Tuition reimbursement.
  • Employee Stock Purchase Plan.
  • Employee Assistance Program.
  • Gartner Gives Charity Match.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service