Ensure the operational integrity, availability, and performance of mission-critical systems. Manage technical incidents, troubleshoot recurring issues, and implement permanent solutions to maintain system stability. Collaborate with cross-functional teams to resolve incidents efficiently and improve system resiliency through proactive monitoring and automation. Handle the identification, triage, and resolution of medium-to-high priority incidents with minimal supervision to ensure business operations are minimally impacted. Collaborate with development teams, business partners, and other stakeholders to diagnose and resolve technical issues, implementing long-term fixes to prevent incident recurrence. Use monitoring tools (e.g., Splunk, Dynatrace, CloudWatch) to detect performance issues and execute corrective actions promptly. Enhance system observability to proactively detect issues and improve overall system performance and stability. Develop and maintain automation scripts to streamline routine production support tasks, reducing manual interventions. Implement automation strategies to improve production stability and minimize downtown. Maintain clear and detailed documentation of troubleshooting procedures, contributing to the shared knowledge base. Provide assistance in improving the incident, problem, and change management processes, following ITIL best practices. Participate in root cause analysis and suggest process improvements to enhance system stability and performance. Collaborate with cross-functional teams in resolving recurring production support issues and optimizing workflows. Actively mentor junior support engineers, fostering technical growth within the team.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior