The purpose of the role is to apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them. This involves proactive monitoring, maintenance, and capacity planning to ensure the availability, performance, and scalability of systems and services. It also includes the resolution, analysis, and response to system outages and disruptions, and implementing measures to prevent similar incidents from recurring. The role also involves the development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience. Additionally, it requires monitoring and optimization of system performance and resource usage, identifying and addressing bottlenecks, and implementing best practices for performance tuning. Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and working closely with other teams to ensure smooth and efficient operations is key. Staying informed of industry technology trends and innovations, and actively contributing to the organization's technology communities to foster a culture of technical excellence and growth is also expected.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Manager
Education Level
No Education Listed