Senior Site Reliability Engineer

Fidelity Investments•Merrimack, NH

15h

About The Position

This position is for a Sr. Site Reliability Engineer within the R4 Responsive OpsWorX Team covering multiple products in the Brokerage Recordkeeping Technology organization. This Engineer will be responsible for responding to production incidents. You will closely work with our business partners responding to application specific questions and work with the product teams to promote availability, resilience, and stability.

Requirements

Bachelor’s degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required.
Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE) to ensure system stability, scalability, and performance.
3 years of hands-on experience with Amazon EKS and RDS.
Experience implementing and maintaining CI/CD pipelines.
Experience designing, implementing, and continuously improving observability solutions.
Experience instrumenting applications and infrastructure.
Experience leading incident response and conducting root cause analysis.
Experience collaborating with development, infrastructure, security, and business teams.
Experience analyzing and reverse‑engineering existing applications.
Demonstrated adaptability and a strong learning mindset.

Nice To Haves

Master’s degree is a plus.
Certification in public Cloud (AWS) or Kubernetes is a plus.

Responsibilities

Responding to production incidents.
Closely work with business partners responding to application specific questions.
Work with product teams to promote availability, resilience, and stability.
Build, manage, and optimize resilient, scalable cloud platforms using AWS-native services.
Lead and execute cloud migration initiatives, ensuring minimal downtime, performance optimization, and adherence to architectural best practices.
Implement and maintain CI/CD pipelines to enable reliable, automated, and secure application deployments.
Ensure platforms meet high availability, scalability, fault tolerance, and disaster recovery requirements.
Design, implement, and continuously improve observability solutions, including: Monitoring, Logging, Alerting, Distributed tracing using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk.
Instrument applications and infrastructure to provide end-to-end visibility into system health, performance, and reliability.
Proactively identify performance bottlenecks, capacity risks, and failure points; recommend and implement remediation strategies.
Lead incident response, providing rapid triage and resolution during production outages or performance degradation.
Conduct root cause analysis (RCA) for critical incidents and drive corrective and preventive actions.
Collaborate closely with development, infrastructure, security, and business teams to ensure alignment with operational and business objectives.
Analyze and reverse‑engineer existing applications to understand system behavior, integrations, and dependencies.
Continuously evaluate emerging technologies, tools, and industry trends to improve platform reliability and operational efficiency.
Demonstrate adaptability and a strong learning mindset in a fast-paced, evolving environment.
Apply Generative AI tools responsibly to improve productivity, including assisting with analysis, documentation, summarization, and ideation activities.
Utilize SQL and relational databases (Oracle or other RDBMS) to support application troubleshooting, reporting, and performance analysis.