Observability Engineer (Prometheus / Grafana / Datadog)

Bright Vision TechnologiesFrisco, TX
Remote

About The Position

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. As we continue to grow, we’re looking for a skilled Observability Engineer (Prometheus / Grafana / Datadog) to join our dynamic team and contribute to our mission of transforming business processes through technology. This is a fantastic opportunity to join an established and well-respected organization offering tremendous career growth potential.

Requirements

  • Bachelor’s degree in Computer Science or a related field.
  • Five or more years of experience in SRE, platform engineering, or observability roles.
  • Deep hands-on experience with Prometheus, Grafana, and at least one major commercial observability platform such as Datadog, New Relic, or Splunk.
  • Strong understanding of OpenTelemetry, distributed tracing, and structured logging.
  • Proficiency in at least one general-purpose language such as Go, Python, or Java.
  • Experience operating high-cardinality, high-throughput metrics and log pipelines.
  • Strong understanding of SLOs, error budgets, and SRE principles.
  • Experience integrating observability with CI/CD and incident management tooling.
  • Solid grasp of Linux internals, networking, and container platforms.
  • Excellent communication and collaboration skills.

Nice To Haves

  • Experience with Thanos, Mimir, Cortex, Loki, or Tempo at scale.
  • Contributions to OpenTelemetry or observability open-source projects.
  • Familiarity with eBPF-based observability tooling.
  • Experience driving observability cost optimization initiatives.
  • Exposure to regulated environments with audit-grade logging requirements.

Responsibilities

  • Design and operate enterprise-grade observability platforms covering metrics, logs, traces, events, and synthetic monitoring.
  • Architect Prometheus / Thanos / Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog deployments for high availability and scale.
  • Develop standards for service instrumentation, including OpenTelemetry adoption, metric naming, label cardinality, and structured logging conventions.
  • Define and enforce SLOs, SLIs, and error budgets, and build the dashboards and alerts that operationalize them.
  • Build alerting strategies that minimize noise, surface actionable signals, and integrate cleanly with on-call workflows in PagerDuty, Opsgenie, or similar tools.
  • Operate large-scale time-series and log storage platforms, balancing retention, query performance, and cost.
  • Design distributed tracing pipelines and help teams use traces to diagnose latency and reliability issues.
  • Develop self-service tooling, paved-road libraries, and templates that make adoption of observability standards easy for product teams.
  • Drive cost management and label-cardinality discipline across the observability estate.
  • Lead incident response readiness improvements through better dashboards, alerting hygiene, and post-incident analysis tooling.
  • Partner with SRE and platform teams to integrate observability into deployment pipelines, canary analysis, and progressive delivery workflows.
  • Evaluate and recommend observability vendors and open-source tools based on cost, capability, and operational maturity.
  • Mentor engineering teams on observability fundamentals, debugging techniques, and SLO-driven operations.
  • Maintain documentation, onboarding guides, and runbooks for the observability platform.

Benefits

  • Competitive base salary commensurate with experience, plus benefits.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service