Lead Software Engineer - Observability

WellmarkDes Moines, IA
9dRemote

About The Position

You will be responsible for designing, building, and maintaining observability platform tools and frameworks that enable development and operations teams to monitor and improve the performance, availability, and reliability of systems. This role involves designing and implementing systems that monitor and analyze the performance/health of software applications and infrastructure, ensuring high availability and reliability. The engineer will collaborate closely with development, site reliability engineering, DevOps, and infrastructure teams to deliver a seamless observability ecosystem. Key responsibilities include architecting observability platforms, integrating monitoring tools into software pipelines, ensuring system health visibility, reducing mean time to detection (MTTD), and promoting a culture of proactive monitoring and reliability engineering.

Requirements

  • Bachelor’s degree in Computer Science, MIS, or related field of study and at least 5 years of development experience (ex. Angular, NodeJS, TypeScript, C++, .NET, Java, SQL) OR 9 years of related and applicable experience.
  • Strong analytical problem-solving skills.
  • Accuracy and high attention to detail.
  • Previous experience troubleshooting and developing creative technical solutions.
  • Ability to provide innovative solutions to complex issues.
  • Demonstrated experience in software development lifecycle methodologies.
  • Demonstrated ability to communicate with and coach/mentor team members, while setting an example in maintaining a positive attitude, staying calm under pressure, being approachable, and respectful and taking responsibility for failures.
  • Big picture thinker with the ability to translate the value of the Wellmark as a Service (WaaS) strategy to company strategy when making design and development decisions.
  • Demonstrated, strong ability to gather information, perform necessary research needed for root cause analysis, problem definition and formulation, recommend solution implementation, verification, and ongoing optimization, using data to support recommendations.
  • Demonstrated ability to build relationships to reach outcomes that gain the support and acceptance of all parties.
  • Ability to communicate key information in a timely manner to the appropriate stakeholder audience with the ability to adjust communication style that will best suit the audience.
  • Ability to thrive in fast-paced environment with changing priorities.
  • Excellent organizational skills.
  • Strong time management skills with the ability to set and meet established timeframes with little direction, while assuring data and information integrity.
  • Eagerness to learn and stay current on industry trends and have a continuous learning mindset.
  • Ability to collaborate and work as a team to accomplish goals and/or solve problems.
  • Ability to earn trust and respect from peers, leadership, and stakeholders.
  • Ability to learn by actively listening and applying coaching feedback.
  • Ability to lead, support and work within a diverse development team model including global staffing, crowd sourcing, etc.

Nice To Haves

  • 3–5 years of experience in Site Reliability Engineering, DevOps, or Observability/Monitoring engineering roles.
  • Proven experience building or administering observability platforms in production environments.
  • Track record of improving system reliability and reducing mean time to resolution (MTTR).
  • Hands-on experience with one or more observability platforms: Dynatrace, Prometheus, Grafana, OpenTelemetry, Elastic Stack, Splunk, Datadog, New Relic, AppDynamics, Honeycomb.
  • Strong knowledge of observability concepts: metrics, logs, traces, SLOs/SLIs, error budgets.
  • Experience working within an Agile team environment
  • Experience deploying and maintaining Open Telemetry-based observability pipelines.
  • Prior experience working in highly regulated environments with compliance observability needs.
  • Contributions to observability open-source projects.
  • Familiarity with chaos engineering practices to validate monitoring and resilience.
  • Certifications from AWS, Microsoft Azure, or Google Cloud
  • Demonstrated experience coaching/mentoring others by providing guidance and feedback to help an employee or groups of employees strengthen their knowledge and skills to accomplish a task or solve a problem
  • Excellent problem-solving skills with a strong analytical mindset.
  • Strong written and verbal communication skills, including the ability to explain complex technical topics to both engineers and business stakeholders.
  • Proven experience with designing technical architecture and keeping abreast of existing and emerging technologies.
  • Experiencing consulting with stakeholders to understand needs with the intention of providing advice and counsel. Also interacting appropriately with others to guide individuals or groups to accomplish work, reach consensus, or take action.
  • Proficiency in programming or scripting languages (Python, Go, Java, Bash, etc.) for observability automation.
  • Experience with containerization and orchestration platforms (Docker, Kubernetes).
  • Deep knowledge of cloud platforms (AWS, Azure, GCP), observability/monitoring services, operating systems (Windows/Linux), networking, and containerization.
  • Strong understanding of distributed systems, microservices, and cloud-native architectures.
  • Proficiency in CI/CD pipelines and how observability integrates into DevOps workflows.
  • Knowledge of incident management and on-call practices.
  • Experience with supporting observability and monitoring for Artificial Intelligence agents

Responsibilities

  • Design, build, and maintain observability platforms with reusability across services in mind.
  • Develop scalable, automated pipelines for ingesting, transforming, and visualizing telemetry data.
  • Integrate observability tools (e.g., Dynatrace, Splunk, Prometheus, Grafana, Splunk, Datadog, New Relic, OpenTelemetry) with existing infrastructure and applications.
  • Enable root cause analysis through correlation of metrics, logs, and traces.
  • Analyze telemetry data to identify performance bottlenecks and optimize resource allocation for improved efficiency
  • Define SLIs, SLOs, and error budgets with stakeholders for critical services.
  • Improve incident response by enhancing monitoring dashboards, alerts, and automated notifications.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service