Software Engineering III - SRE

JPMorgan Chase & Co.•Jersey City, NJ

About The Position

As a Site Reliability Engineer in the AI/ML Data Platforms team, you will play a key role in building and supporting scalable, resilient data solutions. You will engage in root cause analysis, production changes, and collaborate with cross-functional teams to drive improvements. You will also mentor team members and partner with colleagues across our global network. Your work will directly impact the reliability and performance of our AI/ML platforms.

Requirements

  • Proficient in site reliability culture and principles, with experience implementing them within applications or platforms
  • Skilled in running production incident calls and managing incident resolution
  • Experience with observability, including monitoring, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
  • Strong understanding of SLI/SLO/SLA and error budgets
  • Proficiency in Python or PySpark for AI/ML modeling
  • Ability to automate tasks and reduce toil through tool development
  • Hands-on experience in system design, resiliency, testing, operational stability, and disaster recovery
  • Awareness of risk controls and compliance with organizational standards
  • Ability to work collaboratively and build meaningful relationships

Nice To Haves

  • Experience in an SRE or production support role with AWS Cloud, Databricks, Snowflake, or similar technologies
  • AWS and Databricks certifications

Responsibilities

  • Develop and support AI/ML solutions for troubleshooting and incident resolution
  • Coordinate incident management coverage to ensure effective resolution of application issues
  • Collaborate with cross-functional teams to perform root cause analysis and implement production changes
  • Apply expertise in application development and support using technologies such as Databricks, Snowflake, AWS, and Kubernetes
  • Mentor and guide team members to drive strategic change
  • Build tools to automate repeated tasks and reduce operational toil
  • Ensure compliance with risk controls and company standards
  • Contribute to system design, resiliency, testing, operational stability, and disaster recovery
  • Foster a collaborative team environment to achieve common goals

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service