About The Position

This position involves building automations using programming languages to reduce manual effort, defining and collecting system metrics, and designing visualizations of system health. The role requires responding to incidents of system instability or unavailability, diagnosing problems, writing software to resolve issues, and performing Root Cause Analyses. Responsibilities also include performing system logging analysis to ensure application stability and troubleshooting system and network issues to find potential areas for improvement. Candidates must meet specific education and experience requirements, including a Master's degree with two years of experience or a Bachelor's degree with five years of experience in a related field, along with a comprehensive set of technical skills.

Requirements

  • Master's degree in Computer Engineering, Computer Science, Electronic Engineering, Computer Information Systems, or related field of study plus two (2) years of experience in the job offered or as Site Reliability Engineer, Software Engineer, Software Developer, or related occupation.
  • Alternatively, a Bachelor's degree in Computer Engineering, Computer Science, Electronic Engineering, Computer Information Systems, or related field of study plus five (5) years of experience in the job offered or as Site Reliability Engineer, Software Engineer, Software Developer, or related occupation.
  • Experience with monitoring platform and application health, including CPU, memory, disk capacity, and API responses using Dynatrace or Datadog.
  • Experience with logging queries and performing analysis for incident troubleshooting using ElasticSearch or AWS CloudWatch.
  • Experience with managing incidents and conducting blameless post-mortems.
  • Experience with designing and developing APIs to support data collection or task automation using Python, Java, Spring Boot, or C#.NET.
  • Experience with automating manual tasks using Microsoft PowerShell or Bash.
  • Experience with implementing observability using white-box and black-box monitoring.
  • Experience with managing incidents using service level objective alerting.
  • Experience with performing telemetry collection for observability using Dynatrace, Prometheus, Datadog, AWS CloudWatch, and Splunk.
  • Experience with developing dashboards to display system, application, and business metrics using Grafana or Splunk.
  • Experience with implementing continuous integration and delivery using Jenkins and Terraform.
  • Experience with managing containers and container orchestration using ECS, Kubernetes, and Docker.
  • Experience with troubleshooting Transmission Control Protocol, Internet Protocol, API communications, and client-server computing to diagnose and resolve application and system failures using Dynatrace and Wireshark.

Responsibilities

  • Build automations using programming languages to reduce manual effort.
  • Define and collect metrics from systems and applications using industry-standard applications or custom-built processes.
  • Design and develop visualizations of system health.
  • Respond to incidents of system instability or unavailability, diagnosing problems, writing software to resolve issues, and performing Root Cause Analyses to determine the reason for an outage.
  • Perform system logging analysis to ensure application stability.
  • Troubleshoot system and network issues to find potential areas for improvement.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service