Lead the response to production issues including identifying and troubleshooting problems and implementing immediate fixes. Ensure minimal downtime and adherence to service level agreements (SLAs). Build alerting, monitoring, and dashboards that identify problems proactively. Utilize strong analytical, technical, and functional skills to diagnose and resolve complex issues within production environments with a focus on immediate impact mitigation. Work with dev teams to implement long-term solutions to prevent recurrence of incidents, Create and maintain comprehensive documentation for system architecture, configuration, deployment procedures, and troubleshooting guides. Develop and maintain scripts and automation tools to streamline operations, deployment processes, and repetitive tasks. Focus on automating recovery processes and routine maintenance tasks to improve system reliability and efficiency. Work with development teams, identify and provide the non-functional requirements and acceptance criteria during design and development, and ensure that these are met prior to moving the features to production. Monitor application performance using APM (Application Performance Management) tools including Dynatrace, App Dynamics, and ELK. Identify bottlenecks and work with dev teams to optimize the performance of applications through code improvements, configuration tuning, and resource optimization. Work with dev teams to define Non- Functional Requirements including reliability, performance, scalability, and application logging for observability. Define SLI/SLOs, Error Budgets, and Automation focus. Work with dev/architect/quality engineering teams to identify and document patterns of failures as lessons learnt from incidents and follow up to implement the remediations to make the application resilient. Monitor system usage patterns and perform capacity planning to ensure scalability and reliability of applications and services. Work on proactive problem detection, trend and pattern analysis, assessment of impact of problems, and functional analysis of problems. Management of Escalated issues, tracking and driving prompt resolution. Provide metrics and status reports and review with leadership and stakeholder communities. Establish processes surrounding metrics gathering, reporting, and communication. Provide prompt visibility and status of escalated issues, incidents and outages to leadership, and business partners and other key stakeholders. Appl strong verbal and written communication skills. Work closely with Product Development teams to ensure Knowledge Transfer related to changes to the system well in advance of change getting operationalized. Be on-call 24x7 support for agent facing applications including Home Grown J2EE apps as well as SaaS Platform apps including Salesforce, Salesforce Marketing Cloud, and Mulesoft. Architect and develop web applications. Utilize observability tools including Dynatrace, App Dynamics, Splunk, ELK, Mulesoft AnyPoint, Quantum Metric, and Catchpoint. Utilize integration technologies, API Gateways, MuleSoft, and WebLogic. Utilize Object Oriented Programming Languages including Java, J2EE technologies, Javascript, and frameworks including Spring. Apply experience with automation tools and scripting languages including Python and Shell. Apply understanding with containerization including Docker and Kubernetes and cloud services including Azure. Apply knowledge of DevOps practices and tools including CI/CD pipelines, Git, and Jenkins. Apply understanding of network protocols, load balancing, and security principles. Apply experience with database SQL queries. Build Linux shell scripts on demand.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior