About The Position

As a Sr Lead Infrastructure Engineer-Infrastructure Monitoring at JPMorgan Chase within the Corporate Technology Enterprise Observability Platforms team, you will lead the modernization of Infrastructure monitoring into a strategic, secure, scalable, and automation-enabled observability platform—strengthening firmwide resilience and delivering trusted operational insights. You will be a hands-on technical contributor who drives adoption and partners across infrastructure, application, and SRE teams to improve telemetry collection and signal quality, modernize event-to-incident workflows, and enable AIOps-driven reliability improvements aligned to business objectives.

Requirements

  • Formal training or certification on infrastructure engineering concepts and 5+ years applied experience
  • Demonstrated experience owning/operating enterprise-scale monitoring/observability platforms in production, and designing & delivering monitoring solutions across large Linux and Windows estates.
  • Strong expertise with enterprise-grade operating systems (Windows Server and/or Enterprise Linux), including secure configuration, patching, and vulnerability remediation in regulated environments.
  • Strong understanding of telemetry concepts (metrics, logs, traces, events) and practical OpenTelemetry collection and integration patterns.
  • Strong infrastructure knowledge across compute, networking, storage, databases, integration patterns, scaling, resiliency, and performance.
  • Advanced proficiency in automation and scripting (Python, Ansible, PowerShell, Bash) with strong use of CI/CD for controlled change and safe rollout.
  • Hands-on experience with infrastructure-as-code for repeatable, governed provisioning and deployments (e.g., Terraform).
  • Extensive experience operating in hybrid infrastructure environments, including enterprise on-prem platforms and public/private cloud, including migration enablement and cloud operational patterns.
  • Hands-on experience with data stores such as MS SQL Server, Oracle, and Cassandra and/or Cloud Native Databases.
  • Strong collaboration skills, with the ability to partner effectively across infrastructure, application, and SRE teams to align observability capabilities.

Nice To Haves

  • Experience operating large-scale enterprise monitoring platforms (e.g., Tivoli, SMARTS, IBM Instana, DX NetOps, ITNM, Netcool Suite) with deep operational ownership.
  • Experience with modern observability ecosystems including Splunk, Dynatrace, Grafana, Prometheus, and multi-tool interoperability patterns.
  • Experience with Kubernetes (e.g., EKS) for container orchestration and production operations.
  • Experience implementing AIOps workflows such as noise reduction, anomaly detection, probable root-cause analysis, and guided remediation with appropriate governance.
  • Experience with topology-driven monitoring and event correlation in large, distributed infrastructure environments.
  • Experience defining and operationalizing SLOs, error budgets, and reliability metrics across platform services.
  • Experience with network monitoring.

Responsibilities

  • Lead the modernization of the infrastructure monitoring platform, defining target-state architecture and roadmap while balancing near-term delivery with long-term resiliency, scalability, security, and usability goals
  • Engineer, operate, and continuously improve enterprise monitoring platforms to meet availability, performance, scale, and security requirements. Own platform design and architecture for telemetry collection and integration across metrics, logs, events, and traces, including OpenTelemetry patterns where applicable
  • Drive large-scale enterprise onboarding across Linux, Windows, and complex network estates, including lifecycle management, versioning/upgrade strategies, and governance controls
  • Standardize onboarding patterns (agents/collectors, configuration baselines, dashboards, alerting, metadata, and runbooks) to enable safe, repeatable adoption
  • Improve signal quality and actionability through baselining, threshold strategy, noise reduction, enrichment, and topology/context alignment to reduce MTTR and operational overhead
  • Develop and maintain production-grade automation, services, and configuration-as-code; establish engineering standards and conduct rigorous reviews for reliability, security, and maintainability
  • Reduce operational toil through automation and CI/CD-driven configuration management, including infrastructure-as-code patterns (e.g., Terraform). Lead production health and operational excellence for the monitoring platform, including incident triage, root-cause analysis, and corrective/preventative actions
  • Partner with infrastructure, application, and SRE teams to align platform capabilities to SLIs/SLOs, operational readiness, and continuous improvement objectives
  • Advance AIOps capabilities (e.g., correlation, anomaly detection, guided remediation) through experimentation, proofs of concept, and governed rollouts, while mentoring junior engineers and fostering a strong engineering culture

Benefits

  • competitive total rewards package including base salary determined based on the role, experience, skill set and location
  • commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions
  • comprehensive health care coverage
  • on-site health and wellness centers
  • a retirement savings plan
  • backup childcare
  • tuition reimbursement
  • mental health support
  • financial coaching

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

No Education Listed

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service