Sr Lead Infrastructure Engineer — Infrastructure Monitoring

JPMorgan Chase & Co.•Wilmington, DE

66d

About The Position

As a Sr Lead Infrastructure Engineer-Infrastructure Monitoring at JPMorgan Chase within the Corporate Technology Enterprise Observability Platforms team, you will lead the modernization of Infrastructure monitoring into a strategic, secure, scalable, and automation-enabled observability platform—strengthening firmwide resilience and delivering trusted operational insights. You will be a hands-on technical contributor who drives adoption and partners across infrastructure, application, and SRE teams to improve telemetry collection and signal quality, modernize event-to-incident workflows, and enable AIOps-driven reliability improvements aligned to business objectives.

Requirements

Formal training or certification on infrastructure engineering concepts and 5+ years applied experience
Demonstrated experience owning/operating enterprise-scale monitoring/observability platforms in production, and designing & delivering monitoring solutions across large Linux and Windows estates.
Strong expertise with enterprise-grade operating systems (Windows Server and/or Enterprise Linux), including secure configuration, patching, and vulnerability remediation in regulated environments.
Strong understanding of telemetry concepts (metrics, logs, traces, events) and practical OpenTelemetry collection and integration patterns.
Strong infrastructure knowledge across compute, networking, storage, databases, integration patterns, scaling, resiliency, and performance.
Advanced proficiency in automation and scripting (Python, Ansible, PowerShell, Bash) with strong use of CI/CD for controlled change and safe rollout.
Hands-on experience with infrastructure-as-code for repeatable, governed provisioning and deployments (e.g., Terraform).
Extensive experience operating in hybrid infrastructure environments, including enterprise on-prem platforms and public/private cloud, including migration enablement and cloud operational patterns.
Hands-on experience with data stores such as MS SQL Server, Oracle, and Cassandra and/or Cloud Native Databases.
Strong collaboration skills, with the ability to partner effectively across infrastructure, application, and SRE teams to align observability capabilities.

Nice To Haves

Experience operating large-scale enterprise monitoring platforms (e.g., Tivoli, SMARTS, IBM Instana, DX NetOps, ITNM, Netcool Suite) with deep operational ownership.
Experience with modern observability ecosystems including Splunk, Dynatrace, Grafana, Prometheus, and multi-tool interoperability patterns.
Experience with Kubernetes (e.g., EKS) for container orchestration and production operations.
Experience implementing AIOps workflows such as noise reduction, anomaly detection, probable root-cause analysis, and guided remediation with appropriate governance.
Experience with topology-driven monitoring and event correlation in large, distributed infrastructure environments.
Experience defining and operationalizing SLOs, error budgets, and reliability metrics across platform services.
Experience with network monitoring.

Responsibilities

Lead the modernization of the infrastructure monitoring platform, defining target-state architecture and roadmap while balancing near-term delivery with long-term resiliency, scalability, security, and usability goals
Engineer, operate, and continuously improve enterprise monitoring platforms to meet availability, performance, scale, and security requirements. Own platform design and architecture for telemetry collection and integration across metrics, logs, events, and traces, including OpenTelemetry patterns where applicable
Drive large-scale enterprise onboarding across Linux, Windows, and complex network estates, including lifecycle management, versioning/upgrade strategies, and governance controls
Standardize onboarding patterns (agents/collectors, configuration baselines, dashboards, alerting, metadata, and runbooks) to enable safe, repeatable adoption
Improve signal quality and actionability through baselining, threshold strategy, noise reduction, enrichment, and topology/context alignment to reduce MTTR and operational overhead
Develop and maintain production-grade automation, services, and configuration-as-code; establish engineering standards and conduct rigorous reviews for reliability, security, and maintainability
Reduce operational toil through automation and CI/CD-driven configuration management, including infrastructure-as-code patterns (e.g., Terraform). Lead production health and operational excellence for the monitoring platform, including incident triage, root-cause analysis, and corrective/preventative actions
Partner with infrastructure, application, and SRE teams to align platform capabilities to SLIs/SLOs, operational readiness, and continuous improvement objectives
Advance AIOps capabilities (e.g., correlation, anomaly detection, guided remediation) through experimentation, proofs of concept, and governed rollouts, while mentoring junior engineers and fostering a strong engineering culture

Benefits

competitive total rewards package including base salary determined based on the role, experience, skill set and location
commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions
comprehensive health care coverage
on-site health and wellness centers
a retirement savings plan
backup childcare
tuition reimbursement
mental health support
financial coaching

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume