About The Position

As a Senior Application Support Engineer (SRE), you will play a critical role in ensuring the stability, reliability, and performance of mission-critical applications at DTCC. This role goes beyond traditional support—focusing on Site Reliability Engineering principles, proactive system improvement, and operational excellence. You will partner closely with development, infrastructure, and global operations teams to enhance system resilience, reduce operational toil, and drive continuous improvement across the platform.

Requirements

  • 6+ years of experience in application support, SRE, or production engineering
  • Bachelor's degree preferred or equivalent experience
  • Strong understanding of SRE principles, including reliability engineering, observability, and incident prevention
  • Experience working in Linux and Windows environments, with strong troubleshooting and log analysis skills
  • Hands-on experience with monitoring and observability tools (e.g., Splunk, Grafana)
  • Working knowledge of SQL for analysis and troubleshooting
  • Experience with ITSM tools (e.g., ServiceNow) for incident, problem, and change management
  • Familiarity with job scheduling and modern platforms (e.g., Autosys, OpenShift, containers)
  • Exposure to mainframe technologies, including job processing, scheduling, and legacy system interactions
  • Understanding of AI/ML concepts in production support (e.g., automation, AIOps, anomaly detection, incident reduction)
  • Understanding of security fundamentals (certificates, access, credentials)
  • Experience supporting AWS-based applications and services
  • Strong communication, ownership, and problem-solving skills in high-pressure environments
  • Experience working with global, distributed teams

Responsibilities

  • Act as a Lead Application Support Engineer with SRE responsibilities, partnering with engineering and infrastructure teams to improve system reliability, resilience, and observability
  • Lead the resolution of critical production incidents, providing clear impact analysis, root cause identification, and preventive actions
  • Own and drive incident, problem, and major incident management, including post-incident reviews and continuous improvement
  • Proactively identify reliability risks and implement solutions to prevent recurrence and reduce operational toil
  • Develop, maintain, and enhance runbooks, knowledge articles, and operational documentation
  • Execute and support release, change, and deployment activities, including production releases and vendor upgrades
  • Support and participate in Disaster Recovery (DR) testing, execution, and audit readiness
  • Drive automation and alert optimization initiatives to improve efficiency and reduce noise
  • Embed risk, control, and reliability best practices into day-to-day operations
  • Collaborate with global teams to ensure high availability and operational excellence across systems

Benefits

  • Competitive compensation, including base pay and annual incentive
  • Comprehensive health and life insurance and well-being benefits, based on location
  • Pension / Retirement benefits
  • Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service