About The Position

This role focuses on responding to and resolving complex client and user operational issues, working with the Enterprise Applications team to define requirements and specifications, and designing/modifying solution strategies to meet SLA requirements. The position involves application configuration, extension, integration, and performance tuning, as well as handling escalated issues and improving standard operating procedures. A significant part of the role involves designing, implementing, and managing monitoring and observability solutions across various environments, configuring observability platforms, developing dashboards, setting up alerting, and monitoring performance against SLAs. The engineer will analyze telemetry data for trends and risks, support incident and problem management, integrate monitoring tools, reduce alert noise, support cloud migrations with observability, collaborate with DevOps/SRE teams, perform capacity planning, automate monitoring processes, maintain operational documentation, ensure compliance, and contribute to continuous improvement initiatives like AIOps and self-healing.

Requirements

  • BA/BS degree and 4–6 years of relevant experience, or equivalent.
  • Experience supporting and configuring enterprise applications in complex environments.
  • Proficiency with observability/monitoring platforms (e.g., Datadog, Dynatrace, New Relic, Splunk, Elastic, Prometheus, Grafana).
  • Hands-on experience with metrics, logs, traces, dashboards, and alerting systems.
  • Strong scripting/automation skills (Python, PowerShell, Bash, or similar).
  • Experience with Infrastructure as Code tools (Terraform, Ansible, CloudFormation).
  • Working knowledge of AWS, Azure, or GCP monitoring and logging services.
  • Experience integrating monitoring tools with ITSM and event management platforms.
  • Ability to perform root cause analysis, trend analysis, and capacity planning using telemetry data.
  • Experience embedding observability into CI/CD pipelines and deployment workflows.
  • Ability to design and optimize alerting rules, thresholds, and correlation logic.
  • Understanding of logging/monitoring compliance, audit, and security requirements.
  • Experience with automation, AIOps, and self-healing or event-driven operations.

Responsibilities

  • Respond to and resolve higher level / more complex client and user operational issues.
  • Work with Enterprise Applications team members to define end-user requirements, functionality specifications and deliverables.
  • Design new and modify existing solution strategies to ensure achievement of SLA requirements.
  • Maintain all technical deliverables including technical design specifications and configuration changes.
  • Perform application configuration, extension, integration and performance tuning.
  • Handle issues escalated from less experienced team members, clients and other stakeholders.
  • Help design, plan and implement enhancements to standard operating procedures.
  • Design, implement, and manage monitoring and observability solutions for servers, applications, databases, network devices, and cloud environments.
  • Configure and maintain observability platforms for metrics, logs, traces, dashboards, and alerts.
  • Develop dashboards and reports to provide visibility into system health, service performance, availability, and operational KPIs.
  • Set up and fine-tune alerting mechanisms, thresholds, escalation rules, and notification workflows to ensure timely incident detection and response.
  • Monitor infrastructure and application performance to ensure adherence to SLA, SLO, and availability targets.
  • Perform analysis of logs, alerts, and telemetry data to identify trends, anomalies, and potential service risks.
  • Support incident management, problem management, and root cause analysis by providing actionable monitoring insights.
  • Integrate monitoring tools with ITSM, event management, automation, and ticketing platforms.
  • Reduce alert noise through alert optimization, event correlation, and dependency mapping.
  • Support cloud and platform migrations by ensuring observability coverage during transformation initiatives.
  • Collaborate with DevOps, SRE, and engineering teams to embed observability into system architecture and deployment pipelines.
  • Perform capacity planning, utilization analysis, and trend forecasting for infrastructure and application environments.
  • Automate monitoring configuration, maintenance, and deployment using scripting or Infrastructure as Code practices.
  • Maintain operational documentation, runbooks, dashboard standards, and monitoring governance processes.
  • Ensure compliance with logging, monitoring, audit, and operational security requirements.
  • Contribute to continuous improvement initiatives including AIOps, self-healing, automation, and proactive event management.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service