Site Reliability Engineer

McKesson-posted 1 day ago

Full-time • Mid Level

Remote • Overland Park, KS

1,001-5,000 employees

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

McKesson is an impact-driven, Fortune 10 company that touches virtually every aspect of healthcare. We are known for delivering insights, products, and services that make quality care more accessible and affordable. Here, we focus on the health, happiness, and well-being of you and those we serve – we care. What you do at McKesson matters. We foster a culture where you can grow, make an impact, and are empowered to bring new ideas. Together, we thrive as we shape the future of health for patients, our communities, and our people. If you want to be part of tomorrow’s health today, we want to hear from you. Rx Savings Solutions (RxSS), part of McKesson’s CoverMyMeds business segment, is seeking a talented Site Reliability Engineer (SRE) to join our team! In this role, you will be instrumental in ensuring the reliability, scalability, and performance of our critical healthcare technology systems. You will apply software engineering principles to operations, focusing on automation, monitoring, and proactive problem-solving to maintain high availability and deliver exceptional user experiences. Our preferred candidate will reside in Columbus, OH, or one of our other hub locations of Overland Park KS, Irving TX or Atlanta GA. Position allows for primarily working from home, with occasional in-office time. We may consider a well-qualified candidate based not located in one of the above hub areas. At this time, we are not able to offer sponsorship for employment visas. We're unable to consider individuals currently on H1B, F-1 OPT, STEM OPT, or any other visa status that would require future sponsorship. Candidates must be authorized to work in the United States on a permanent basis without the need for current or future sponsorship.

System Reliability & Performance: Design, implement, and maintain robust and scalable infrastructure and applications to ensure high availability, performance, and disaster recovery capabilities
Automation & Tooling: Develop and implement automation scripts, tools, and processes to streamline operational tasks, reduce manual effort, and improve efficiency across the software development lifecycle
Monitoring & Alerting: Establish and maintain comprehensive monitoring, alerting, and logging systems to proactively identify and diagnose issues, understand system behavior, and track key performance indicators
Incident Response & Post-Mortem: Participate in on-call rotations, respond to and resolve critical incidents, and conduct thorough post-mortems to identify root causes and implement preventative measures
Capacity Planning & Optimization: Collaborate with development teams to analyze system capacity, forecast future needs, and optimize resource utilization to support business growth
Collaboration & Mentorship: Work closely with software engineers, product managers, and other SREs to promote a culture of reliability, share best practices, and contribute to continuous improvement
Documentation: Create and maintain clear and concise documentation for systems, processes, and incident runbooks
Security: Contribute to the implementation and enforcement of security best practices within our infrastructure and applications

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience, and 2+ years of experience in a Site Reliability Engineering, DevOps, or highly related software engineering role
Strong proficiency in at least one scripting language (e.g., Python, Go, Ruby, Bash) for automation and tool development
Hands-on experience with cloud computing platforms (e.g., AWS, Azure, GCP). AWS experience is highly preferred
Experience with container technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes)
Familiarity with Continuous Integration and Continuous Delivery (CI/CD) pipelines and tools
Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana, Splunk)
Strong understanding of Linux/Unix operating systems
Fundamental understanding of networking concepts (TCP/IP, DNS, HTTP, Load Balancing)
Excellent analytical and problem-solving skills with a proactive approach to identifying and resolving complex technical issues
Strong verbal and written communication skills, with the ability to articulate complex technical concepts to both technical and non-technical audiences

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Site Reliability Engineer Resume Examples

•

Site Reliability Engineer Cover Letter Examples

Site Reliability Engineer

Job Search Resources

Tools

Career Hubs

Guides

Company