Site Reliability Engineer

Infosys LTD•Lake Mary, FL

33d

About The Position

Infosys is looking for Site Reliability Engineer who can work in a global production and Infrastructure support environment for a global bank client, focusing on platform, system and application stability and reliability in terms of user experience and customer care. The responsibilities also include troubleshooting, fault analysis, failure avoidance, diagnosis and resolution for any impacting issues in a distributed environment. You will be working closely with Business Operations, Application Support groups, Development and Engineering teams to provide input on new functionality and system performance, observability, Reliability, Capacity management, monitoring and testing of infra upgrades and platform releases etc.

Requirements

Bachelor's degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education
At least 4 years of Information Technology experience
Candidate must be located within commuting distance of Jersey City, NJ/ Lake Mary, FL Location or be willing to relocate to the area. This position may require travel to project locations.
Candidates authorized to work for any employer in the United States without employer-based visa sponsorship are welcome to apply. Infosys is unable to provide immigration sponsorship for this role at this time.
Experience in Implementing and managing Kubernetes clusters with auto-scaling, self-healing, and release automation. Infrastructure as Code (IaC): Automate infrastructure using Terraform, Ansible, and Helm; manage state files and locking mechanisms.
Experience in Implementing logging, monitoring, and alerting using any one of Dynatrace, Datadog, Splunk, Nagios, Prometheus, Grafana, Open Telemetry for log aggregation, metrics, distributed tracing, and APM.
Experience in Defining and monitor SLOs, SLIs, SLAs; implement DORA metrics including MTTR; conduct incident response and root cause analysis.

Nice To Haves

At least 5 years of experience in Developing Python scripts to automate SRE tasks and improve operational efficiency.
At least 5 years of Experience in Release Management- Participate in release planning, pre/post checks, and change request processes.
Experience in Dashboards & Alerts- Create Splunk dashboards and configure alerts for business transactions and system health.
Experience in SRE Enablement: Lead SRE onboarding initiatives, establish reliability frameworks, and drive cultural adoption across teams.
Strong understanding of SRE principles and golden signals (latency, traffic, errors, saturation).
Experience with SLA/SLO/SLI implementation and tracking.
Proficiency in python or similar languages for automation.
Monitor project deliverables across all phases, to meet established quality benchmarks and stakeholder expectations
Strong communication and Analytical skills
Experience and desire to work in a Global delivery environment