Principal System Engineering

AT&T•Atlanta, GA

1d•Onsite

About The Position

This position requires office presence of a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered. AT&T will not hire any applicants for this position who require employer sponsorship now or in the future. Join AT&T and reimagine the communications and technologies that connect the world. The Chief Information Ofﬁce is responsible for advancing information technology performance and delivering solutions with a focus on maximizing ROI, increasing efﬁciency and enhancing the experience of end users. Guided by experienced leaders, Corporate Systems seamlessly integrate with advanced Technology and Operations to drive our enterprise forward. Our Systems Reliability and Software Delivery teams are unwavering in their commitment to excellence, ensuring every solution is robust and efﬁcient. When you step into a career with AT&T, you won’t just imagine the future-you’ll create it. In this role, you will focus on understanding why production incidents happen and how to prevent them from recurring. You will analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses. You will turn incident insights into high-quality postmortems and partner with engineering teams to drive corrective actions and long-term improvements. By combining system-level thinking with data, automation, and AI-assisted analysis, you will help shift the organization from reactive response to proactive reliability and incident prevention. You will partner with engineering and software development teams to implement permanent fix and preventive improvements.

Requirements

Proven experience performing deep RCA for production incidents
Strong understanding of end-to-end system architecture (cloud, web apps, APIs, databases, infrastructure)
Hands-on experience with observability tools (logs, metrics, traces)
Ability to identify patterns and drive preventive actions
Experience writing clear, structured postmortems
Ability to analyze operational data using tools, queries, or AI-assisted methods
Strong systems thinking and problem-solving skills
7+ years in Systems Engineering, ITSM, RM/CM
Background in SRE, Support or QA
One or more of the following SRE Tools: T-APM, T-Trace, CatchPoint, Grafana
Hands-on experience and understanding of concepts and tools such as SAFe, Agile, DevOps, CI/CD, Data Analytics, and building Gen AI use cases
Experience with AI technologies, Python, SQL, data analytics, Power BI and ITSM tools (e.g., ServiceNow)
Modern Enterprise Release Management/Change Management and ITSM

Nice To Haves

Background in QA, test engineering, or automation engineering (strong plus)
Experience using AI or advanced analytics for incident analysis or pattern detection
Understanding of distributed systems and failure modes
Experience with data analysis / visualization tools (e.g., Power BI, Tableau)
Mindset focused on eliminating recurring issues, not just fixing incidents
Strong communication skills to explain complex issues clearly
BS/BA in Computer Science
Preferred tools: modern Release Management processes for Agile and DevOps environments
Jira Align, JSM, Jira Cloud, Git for enterprise RM/CM
Relevant certifications (SAFe, Agile, DevOps, AI/ML)

Responsibilities

Focus on understanding why production incidents happen and how to prevent them from recurring.
Analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses.
Turn incident insights into high-quality postmortems.
Partner with engineering teams to drive corrective actions and long-term improvements.
Combine system-level thinking with data, automation, and AI-assisted analysis to shift the organization from reactive response to proactive reliability and incident prevention.
Partner with engineering and software development teams to implement permanent fix and preventive improvements.

Benefits

Medical/Dental/Vision coverage
401(k) plan
Tuition reimbursement program
Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
Paid Parental Leave
Paid Caregiver Leave
Additional sick leave beyond what state and local law require may be available but is unprotected
Adoption Reimbursement
Disability Benefits (short term and long term)
Life and Accidental Death Insurance
Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
Employee Assistance Programs (EAP)
Extensive employee wellness programs
Employee discounts up to 50% off on eligible AT&T mobility plans and accessories
AT&T internet (and fiber where available) and AT&T phone.