Senior Site Reliability Engineer

InfosysHartford, CT
117d

About The Position

Infosys is seeking a Senior Site Reliability Engineer. This candidate will be a key player to provide implement SRE practices focused on observability, event correlation, AIOps, chaos engineering, automation. Candidate will work at the intersection of development and operations, ensuring high availability, scalability, and performance of systems in scope.

Requirements

  • Candidate must be located within commuting distance of Richardson, TX, Raleigh, NC, Phoenix, AZ, Hartford, CT, or Indianapolis, IN, or be willing to relocate.
  • Bachelor's degree or foreign equivalent required from an accredited institution, or three years of progressive experience in the specialty in lieu of every year of education.
  • At least 11 years of Information Technology experience.
  • At least 6 years of Site Reliability Engineering (SRE) experience in large programs with a focus on architecting and implementing observability and automation.

Nice To Haves

  • Working knowledge of troubleshooting and providing speedy solutions in case of database failure.
  • Knowledge of SLI, SLO, error budgets.
  • Experience with event correlation, AIOps, and ITSM tools.
  • Familiarity with microservices architecture with APIs and REST APIs.
  • Experience with CI/CD tooling and best practices.
  • Knowledge of cloud platforms such as AWS, Azure, and Google.
  • Experience with container orchestration and practices, including Kubernetes and Docker Swarm.
  • Familiarity with infrastructure automation tools like Terraform, Cloud Formation, Ansible, and Puppet.
  • Proficiency in scripting languages such as Python, JSON, Java, Node.JS, PHP, PowerShell, or Bash/Shell/Perl.
  • Experience with ITSM tools such as ServiceNow.
  • Excellent communication and client interaction skills.
  • Strong planning, project management, coordination, and analytical skills.
  • Hands-on experience in working in a Global Delivery Model with onsite/offshore resources.
  • Exceptional organizational skills and ability to manage and prioritize tasks efficiently.
  • Proactive attitude and solid attention to detail.

Responsibilities

  • Implement SRE practices focused on observability, event correlation, AIOps, chaos engineering, automation.
  • Ensure high availability, scalability, and performance of systems.
  • Implement logging, monitoring, and alerting using tools like Dynatrace, Datadog, Splunk, Nagios, Prometheus, Grafana, ELK stack, or New Relic.
  • Analyze monitoring data/golden signals to identify trends and patterns and proactively address potential problems.
  • Engage to debug, optimize code, and automate routine operational tasks.
  • Improve automation and increase the system's self-healing capability.
  • Participate in production incidents, perform root cause analysis (RCA), and drive post-mortem improvements.
  • Develop and maintain dashboards and reports to visualize system health and performance.
  • Use technologies such as Ansible, Python, Terraform, Powershell/Shell, JSON to create automation to reduce toil in operations.
  • Develop automation solutions for repeated incidents/service tasks for provisioning, scaling, backup, performance management, security, capacity management for infrastructure operations.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Career Level

Senior

Industry

Professional, Scientific, and Technical Services

Education Level

Bachelor's degree

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service