This position requires someone to be in an office work setting in Columbus, OH. As a Site Reliability Engineer (SRE) Level II, you will play a key role in maintaining the availability, scalability, and performance of critical infrastructure and services. You will be responsible for building and automating solutions that enhance system reliability and support continuous delivery. In this role, you will handle more complex operational tasks and incidents, provide mentorship to junior SREs, and collaborate with development teams to ensure systems are designed for reliability from the ground up. Your future duties and responsibilities: Incident Management: . Lead troubleshooting efforts for high-impact production issues, providing detailed root cause analysis (RCA) and preventative measures. . Participate in on-call rotations, acting as an escalation point for Level 1 SREs during major incidents. Automation & Infrastructure as Code (IaC): . Develop and maintain automation scripts and infrastructure using tools like Terraform, Ansible, or CloudFormation. . Implement automation solutions to eliminate manual tasks and improve system reliability, scalability, and performance. Performance & Scalability: . Analyze system performance and recommend optimizations for scalability and reliability. . Support capacity planning efforts by monitoring system metrics, traffic . patterns, and usage trends to predict future resource needs. System Design & Architecture: . Collaborate with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the start. . Contribute to architectural decisions, ensuring alignment with best practices in fault tolerance, redundancy, and recovery. Monitoring & Observability: . Build and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end users. . Optimize existing monitoring tools (e.g. Prometheus, Grafana, Datadog, Dynatrace) and build custom dashboards for better visibility into system health. Security & Compliance: . Ensure systems and infrastructure are secure, compliant, and aligned with organizational policies and industry's best practices. . Assist with vulnerability management, system patching, and implement security measures to protect the integrity and availability of services. Continuous Improvement: . Lead efforts to continuously improve operational processes, tools, and workflows. . Implement and enforce best practices in deployment, monitoring, and incident management to improve overall system reliability and reduce downtime.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level