Senior System Engineering

AT&T•Plano, TX

6d•$163,400 - $215,800•Onsite

About The Position

The Senior System Engineering role is responsible for providing 24x7 Tier 1 support for agent-facing applications including Salesforce, Salesforce Marketing Cloud, Mulesoft, OPUS, and Customer Account Lookup & Management (CALM). This position involves managing escalated issues, incidents, and outages, triaging them, and driving prompt resolution. The role requires providing timely visibility and status updates on these issues to leadership, business partners, and other key stakeholders. A significant aspect of the role involves Site Reliability Engineering (SRE) tasks, such as developing application knowledge bases, creating run books, and enhancing application observability through alerts, monitoring, and dashboards for proactive incident and problem detection. The Senior System Engineer will also triage incidents, assist Tier 2 in conducting blameless post-mortems, and collaborate with Release Management to identify and mitigate risks associated with production changes. Close work with Product Development and Tier 2 SRE teams is essential for knowledge transfer regarding system changes. The role also focuses on optimizing the Tier 1 on-call process and incident response workflow, including alert rules, communication methods, and response plans. Providing metrics and status reports, establishing processes for data gathering and reporting, and staying current on feature development to ensure system reliability are key duties. Additionally, the role involves assisting in the development and maintenance of technology operations and support Standard Operating Procedures (SOPs) and T1 documentation based on industry best practices. The position requires technical leadership with strong communication skills and the ability to foster a self-motivated team, conducting rigorous due diligence on all plans.

Requirements

Requires a Bachelor’s degree, or foreign equivalent degree in Electronics Engineering, Computer Science or Engineering
2 Years of experience in the job offered or 2 Years of experience in a related occupation demonstrating leadership and building cross-organizational consensus
building and managing high-performing teams
Incident Management, Incident response and managing Tier 1 Production Operations team
supporting large scale applications in production –ERP, CRM in a leadership capacity
Salesforce Development (Apex, Visualforce, Lightning), Salesforce Sales Cloud & Service Cloud, MuleSoft, Dynatrace and ELK (Elastic, Logstash, Kibana) for monitoring and logging
Customer Experience Analytics &Session Based tools - Quantum Metric and Tealeaf
Synthetic Monitoring tools (Catchpoint)
Application Performance Monitoring tools (Dynatrace, AppDynamics, Introscope, etc.)
Kibana and Grafana visualization tools
Creation of Dashboards on Dynatrace, ELK and Grafana

Responsibilities

Provide 24x7 Tier 1 support for agent facing applications –Salesforce, Salesforce Marketing Cloud, Mulesoft, OPUS, Customer Account Lookup & Management (CALM).
Manage escalated issues, incidents and outages, triage and driving prompt resolution.
Provide prompt visibility and status of escalated issues, incidents and outages to leadership, business partners and other key stakeholders.
Responsible for Site Reliability Engineering aspects such as developing functional and technical knowledge-base of the application, creation of run books, developing observability of the application in terms of alerts, monitoring and dashboards that enable proactive incident and problem detection, triaging of the incidents and helping Tier 2 conduct blameless post-mortems (after action reviews).
Work with Release Management related to upcoming changes to production to identify risks and mitigate them.
Work closely with Product Development & Tier 2 SRE teams to ensure Knowledge Transfer related to changes to the system well in advance of change getting operationalized.
Optimize the overall T1 on-call process and incident response workflow, including alert rules, communication methods and incident response plans.
Provide metrics and status reports and review with leadership and stakeholder communities.
Establish processes surrounding metrics gather, reporting and communication.
Stay current on feature development and how it could affect the system’s overall reliability.
Assist in developing, publishing and continually updating technology operations and support Standard Operating Procedures and detailed T1 documentation based on industry best practices.
Provide technical leadership with great communication skills, with an ability to create and organize self-motivated team.
Conduct rigorous due diligence on all plans.

Benefits

Medical/Dental/Vision coverage
401(k) plan
Tuition reimbursement program
Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
Paid Parental Leave
Paid Caregiver Leave
Additional sick leave beyond what state and local law require may be available but is unprotected
Adoption Reimbursement
Disability Benefits (short term and long term)
Life and Accidental Death Insurance
Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
Employee Assistance Programs (EAP)
Extensive employee wellness programs
Employee discounts up to 50% off on eligible AT&T mobility plans and accessories, AT&T internet (and fiber where available) and AT&T phone