Platform Engineer - Observability

Blue Cross and Blue Shield AssociationDes Moines, IA
78d

About The Position

The Observability Platform Engineer is responsible for designing, building, and maintaining observability platform tools and frameworks that enable development and operations teams to monitor and improve the performance, availability, and reliability of systems. This role involves designing and implementing systems that monitor and analyze the performance/health of software applications and infrastructure, ensuring high availability and reliability. The engineer will collaborate closely with development, site reliability engineering, DevOps, and infrastructure teams to deliver a seamless observability ecosystem. Key responsibilities include architecting observability platforms, integrating monitoring tools into software pipelines, ensuring system health visibility, reducing mean time to detection (MTTD), and promoting a culture of proactive monitoring and reliability engineering.

Requirements

  • Bachelor's Degree or direct and applicable work experience.
  • Minimum 7 years of experience to include any combination of development experience (e.g., Angular 2 (or newer), NodeJS (or newer), TypeScript, C#, .NET, Java, SQL).
  • Minimum 4 years of experience in IT infrastructure, architecture design, and operations.
  • Proven ability to adapt when experiencing major changes in work tasks or work environment.
  • Informal leadership experience typically gained through leading projects.
  • Demonstrated experience in problem solving/troubleshooting skills.
  • Demonstrated communication skills: verbal and written.

Nice To Haves

  • 3-5 years of experience in Site Reliability Engineering, DevOps, or Observability/Monitoring engineering roles.
  • Proven experience building or administering observability platforms in production environments.
  • Track record of improving system reliability and reducing mean time to resolution (MTTR).
  • Hands-on experience with one or more observability platforms: Dynatrace, Prometheus, Grafana, OpenTelemetry, Elastic Stack, Splunk, Datadog, New Relic, AppDynamics, Honeycomb.
  • Strong knowledge of observability concepts: metrics, logs, traces, SLOs/SLIs, error budgets.
  • Experience working within an Agile team environment.
  • Experience deploying and maintaining Open Telemetry-based observability pipelines.
  • Prior experience working in highly regulated environments with compliance observability needs.
  • Contributions to observability open-source projects.
  • Familiarity with chaos engineering practices to validate monitoring and resilience.
  • Certifications from AWS, Microsoft Azure, or Google Cloud.
  • Demonstrated experience coaching/mentoring others.
  • Proficiency in programming or scripting languages (Python, Go, Java, Bash, etc.) for observability automation.
  • Experience with containerization and orchestration platforms (Docker, Kubernetes).
  • Deep knowledge of cloud platforms (AWS, Azure, GCP), observability/monitoring services, operating systems (Windows/Linux), networking, and containerization.
  • Strong understanding of distributed systems, microservices, and cloud-native architectures.
  • Proficiency in CI/CD pipelines and how observability integrates into DevOps workflows.
  • Knowledge of incident management and on-call practices.
  • Experience with supporting observability and monitoring for Artificial Intelligence agents.

Responsibilities

  • Design, build, and maintain observability platforms with reusability across services in mind.
  • Develop scalable, automated pipelines for ingesting, transforming, and visualizing telemetry data.
  • Integrate observability tools (e.g., Dynatrace, Splunk, Prometheus, Grafana, Datadog, New Relic, OpenTelemetry) with existing infrastructure and applications.
  • Enable root cause analysis through correlation of metrics, logs, and traces.
  • Analyze telemetry data to identify performance bottlenecks and optimize resource allocation for improved efficiency.
  • Define SLIs, SLOs, and error budgets with stakeholders for critical services.
  • Improve incident response by enhancing monitoring dashboards, alerts, and automated notifications.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service