Staff Software Engineer - DevOps

WellSky•Overland Park, KS

23h•Remote

About The Position

The Staff Software Engineer - DevOps is responsible for all stages of the software development lifecycle using a variety of technologies and tools to build impactful software solutions. The scope of this job includes building and optimizing comprehensive solutions that prioritize end-user efficiency and experience. Key Responsibilities: Lead the design and architecture of major systems and services, and ensure software solutions are scalable, reliable, maintainable, and aligned with business needs. Collaborate with solution managers, engineers, data scientists, and other stakeholders to define and prioritize technical requirements that meet client needs and business objectives. Collaborate with teams to ensure sustained quality and reliability of our software solutions, and act as a go-to expert by identifying and resolving complex, high-priority issues in both development and production environments. Actively contribute to code reviews, provide constructive feedback on design and implementation, and provide technical guidance to other engineers to elevate skills, productivity, and overall effectiveness. Drive innovation by evaluating and implementing new technologies, methodologies, and AI capabilities that improve team efficiency, software performance, and development processes. Ensure code meets functional and performance requirements, advocate for high-quality software, and ensure rigorous testing processes, including automated unit tests, integration tests, and other testing frameworks. Leverage AI tools and platforms as an integral part of daily responsibilities to enhance decision-making, streamline workflows, and drive data-informed outcomes. Perform other job duties as assigned. Ensure the reliability, availability, and performance of our systems and services. Work closely with various teams to build and maintain scalable, efficient, and resilient infrastructure. Incident management; lead the response to system outages and incidents, ensuring quick resolution and minimal impact on end-users. Conduct post-incident reviews and implement improvements to prevent recurrence. Monitoring and Alerting; design, implement, and maintain monitoring and alerting systems using tools like New Relic, Grafana, and ELK stack to ensure system health and performance. Perform other job duties as assigned.

Requirements

Bachelor’s degree or relevant work experience
8-12 years of relevant work experience
Proven experience in a Site Reliability Engineer role
Strong expertise in Kubernetes and container management
Experience with cloud platforms, preferably Google Cloud Platform (GCP)
Familiarity with observability and APM tools (e.g., New Relic, OpenTelemetry)
Proficiency in infrastructure as code (e.g., Terraform)
Solid understanding of CI/CD pipelines and deployment automation
Willing to work additional or irregular hours as needed
Must work in accordance with applicable security policies and procedures to safeguard company and client information
Must be able to sit and view a computer screen for extended periods of time

Nice To Haves

Experience with Azure DevOps Pipelines and Argo CD
Strong networking fundamentals, including experience with Istio and service mesh technologies
Healthcare industry experience

Responsibilities

Lead the design and architecture of major systems and services, and ensure software solutions are scalable, reliable, maintainable, and aligned with business needs.
Collaborate with solution managers, engineers, data scientists, and other stakeholders to define and prioritize technical requirements that meet client needs and business objectives.
Collaborate with teams to ensure sustained quality and reliability of our software solutions, and act as a go-to expert by identifying and resolving complex, high-priority issues in both development and production environments.
Actively contribute to code reviews, provide constructive feedback on design and implementation, and provide technical guidance to other engineers to elevate skills, productivity, and overall effectiveness.
Drive innovation by evaluating and implementing new technologies, methodologies, and AI capabilities that improve team efficiency, software performance, and development processes.
Ensure code meets functional and performance requirements, advocate for high-quality software, and ensure rigorous testing processes, including automated unit tests, integration tests, and other testing frameworks.
Leverage AI tools and platforms as an integral part of daily responsibilities to enhance decision-making, streamline workflows, and drive data-informed outcomes.
Perform other job duties as assigned.
Ensure the reliability, availability, and performance of our systems and services.
Work closely with various teams to build and maintain scalable, efficient, and resilient infrastructure.
Incident management; lead the response to system outages and incidents, ensuring quick resolution and minimal impact on end-users.
Conduct post-incident reviews and implement improvements to prevent recurrence.
Monitoring and Alerting; design, implement, and maintain monitoring and alerting systems using tools like New Relic, Grafana, and ELK stack to ensure system health and performance.