Sr Monitoring & Observability Engineer, Los Angeles (On-Site)

Data Analysis Inc.•Los Angeles, CA

6d•$115,000 - $125,000•Onsite

About The Position

The Senior Monitoring & Observability Engineer (TOC) is a senior-level infrastructure and reliability engineering role responsible for designing, implementing, optimizing, and supporting enterprise monitoring and observability platforms across networks, systems, cloud environments, and critical business applications. The position combines observability engineering, cloud and infrastructure operations, automation, and incident management responsibilities, including ownership of monitoring tools such as Datadog, Splunk, SolarWinds, Dynatrace, AppDynamics, Nagios, PRTG, and Zabbix. Acting as a technical escalation point, the role partners closely with Infrastructure, Security, DevOps, and IT Operations teams to improve system reliability, alert quality, operational efficiency, and service availability while supporting SRE-aligned practices such as automation, root cause analysis, SLIs/SLOs, and continuous operational improvement.

Requirements

Bachelor’s degree in IT, Computer Science, Networking, or a related field (or equivalent work experience).
3+ years of experience in IT operations, network monitoring, or system administration, with hands-on experience implementing and tuning enterprise monitoring/observability platforms.
Demonstrated experience building or implementing (not just using) one or more of: Datadog, Dynatrace, AppDynamics, Splunk, SolarWinds Orion, Orion DPA, Nagios, PRTG, or Zabbix.
Advanced understanding of network protocols (TCP/IP, BGP, OSPF, VLANs, VPN, DNS, DHCP).
Proficiency in Windows/Linux environments and at least one major cloud platform (AWS, Azure, or GCP).
Familiarity with ITIL best practices for incident, problem, and change management.
Scripting and automation experience using Python, PowerShell, Bash, Ansible, or similar tools.
Working knowledge of cybersecurity best practices, firewall configurations, and SIEM tools.
Strong leadership, communication, and collaboration skills, including the ability to translate monitoring data into clear operational action across cross-functional teams.
Ability to work in a high-stress, dynamic environment while handling multiple high-priority incidents.

Nice To Haves

Hands-on experience designing, implementing, or significantly maturing monitoring and observability platforms in an enterprise environment — not limited to acknowledging alerts or interpreting dashboards.
Strong understanding of the relationship between infrastructure, networking, systems, cloud platforms, logs, metrics, traces, alerts, dashboards, and incident workflows.
Experience in or exposure to a Site Reliability Engineering (SRE) environment, including reliability practices, automation, observability, service health, SLIs/SLOs, error budgets, and post-incident improvement.
Experience reducing alert noise and improving signal quality through threshold tuning, deduplication, correlation, and runbook-driven response.
Comfort working with APIs, configuration-as-code, and CI/CD pipelines as they relate to monitoring deployment and management.
CompTIA A+, Network+, or Security+
Microsoft Fundamentals Certifications (Azure, M365, or Windows Server)
AWS Cloud Practitioner or Azure Fundamentals
ITIL Foundation certification (preferred for Incident Management responsibilities)
Vendor certifications in Datadog, Splunk, Dynatrace, AppDynamics, or SolarWinds are a plus

Responsibilities

Monitor and manage IT infrastructure, network systems, and business applications using enterprise monitoring tools, aligned with the TOC Sr. Engineer scope.
Serve as the first point of escalation for TOC Engineers, providing advanced troubleshooting, guidance, and root cause analysis.
Lead or support incident response, root cause analysis, escalation, and post-incident review processes; ensure issues are properly classified, escalated, and resolved efficiently.
Take key roles in ITIL Incident, Problem, and Change Management processes.
Build and tune monitoring and observability tooling — instrumentation, integrations, dashboards, alert logic, synthetic checks, log pipelines, and APM configuration — not just consume them.
Develop and implement automation scripts and tooling to improve operational efficiency, alerting quality, and response times (Python, PowerShell, Bash, Ansible, or similar).
Analyze system logs, network traffic, event data, and performance metrics to identify trends, reduce alert noise, and prevent outages.
Document monitoring standards, troubleshooting steps, system configurations, dashboards, and runbooks for knowledge sharing.
Collaborate with IT, Security, and DevOps teams to maintain system reliability and security posture.
Work with vendors and service providers to resolve tool, platform, and infrastructure issues.
Participate in 24/7 on-call rotations and provide leadership during major incidents, helping coordinate cross-functional resolution efforts.
Mentor junior TOC/NOC engineers on monitoring tools, dashboards, alert handling, and incident response practices.