Senior Site Reliability Engineer (Application Support)

DTCC•Jersey City, NJ

4d•Hybrid

About The Position

As a Senior Application Support Engineer (SRE), you will play a critical role in ensuring the stability, reliability, and performance of mission-critical applications at DTCC. This role goes beyond traditional support—focusing on Site Reliability Engineering principles, proactive system improvement, and operational excellence. You will partner closely with development, infrastructure, and global operations teams to enhance system resilience, reduce operational toil, and drive continuous improvement across the platform.

Requirements

6+ years of experience in application support, SRE, or production engineering
Bachelor's degree preferred or equivalent experience
Strong understanding of SRE principles, including reliability engineering, observability, and incident prevention
Experience working in Linux and Windows environments, with strong troubleshooting and log analysis skills
Hands-on experience with monitoring and observability tools (e.g., Splunk, Grafana)
Working knowledge of SQL for analysis and troubleshooting
Experience with ITSM tools (e.g., ServiceNow) for incident, problem, and change management
Familiarity with job scheduling and modern platforms (e.g., Autosys, OpenShift, containers)
Exposure to mainframe technologies, including job processing, scheduling, and legacy system interactions
Understanding of AI/ML concepts in production support (e.g., automation, AIOps, anomaly detection, incident reduction)
Understanding of security fundamentals (certificates, access, credentials)
Experience supporting AWS-based applications and services
Strong communication, ownership, and problem-solving skills in high-pressure environments
Experience working with global, distributed teams

Responsibilities

Act as a Lead Application Support Engineer with SRE responsibilities, partnering with engineering and infrastructure teams to improve system reliability, resilience, and observability
Lead the resolution of critical production incidents, providing clear impact analysis, root cause identification, and preventive actions
Own and drive incident, problem, and major incident management, including post-incident reviews and continuous improvement
Proactively identify reliability risks and implement solutions to prevent recurrence and reduce operational toil
Develop, maintain, and enhance runbooks, knowledge articles, and operational documentation
Execute and support release, change, and deployment activities, including production releases and vendor upgrades
Support and participate in Disaster Recovery (DR) testing, execution, and audit readiness
Drive automation and alert optimization initiatives to improve efficiency and reduce noise
Embed risk, control, and reliability best practices into day-to-day operations
Collaborate with global teams to ensure high availability and operational excellence across systems

Benefits

Competitive compensation, including base pay and annual incentive
Comprehensive health and life insurance and well-being benefits, based on location
Pension / Retirement benefits
Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume