Principal System Engineering

AT&TAtlanta, GA
Onsite

About The Position

This position requires office presence of a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered. AT&T will not hire any applicants for this position who require employer sponsorship now or in the future. Join AT&T and reimagine the communications and technologies that connect the world. The Chief Information Office is responsible for advancing information technology performance and delivering solutions with a focus on maximizing ROI, increasing efficiency and enhancing the experience of end users. Guided by experienced leaders, Corporate Systems seamlessly integrate with advanced Technology and Operations to drive our enterprise forward. Our Systems Reliability and Software Delivery teams are unwavering in their commitment to excellence, ensuring every solution is robust and efficient. When you step into a career with AT&T, you won’t just imagine the future-you’ll create it. In this role, you will focus on understanding why production incidents happen and how to prevent them from recurring. You will analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses. You will turn incident insights into high-quality postmortems and partner with engineering teams to drive corrective actions and long-term improvements. By combining system-level thinking with data, automation, and AI-assisted analysis, you will help shift the organization from reactive response to proactive reliability and incident prevention. You will partner with engineering and software development teams to implement permanent fix and preventive improvements.

Requirements

  • Proven experience performing deep RCA for production incidents
  • Strong understanding of end-to-end system architecture (cloud, web apps, APIs, databases, infrastructure)
  • Hands-on experience with observability tools (logs, metrics, traces)
  • Ability to identify patterns and drive preventive actions
  • Experience writing clear, structured postmortems
  • Ability to analyze operational data using tools, queries, or AI-assisted methods
  • Strong systems thinking and problem-solving skills
  • 7+ years in Systems Engineering, ITSM, RM/CM
  • Background in SRE, Support or QA
  • One or more of the following SRE Tools: T-APM, T-Trace, CatchPoint, Grafana
  • Hands-on experience and understanding of concepts and tools such as SAFe, Agile, DevOps, CI/CD, Data Analytics, and building Gen AI use cases
  • Experience with AI technologies, Python, SQL, data analytics, Power BI and ITSM tools (e.g., ServiceNow)
  • Modern Enterprise Release Management/Change Management and ITSM

Nice To Haves

  • Background in QA, test engineering, or automation engineering (strong plus)
  • Experience using AI or advanced analytics for incident analysis or pattern detection
  • Understanding of distributed systems and failure modes
  • Experience with data analysis / visualization tools (e.g., Power BI, Tableau)
  • Mindset focused on eliminating recurring issues, not just fixing incidents
  • Strong communication skills to explain complex issues clearly
  • BS/BA in Computer Science
  • Preferred tools: modern Release Management processes for Agile and DevOps environments
  • Jira Align, JSM, Jira Cloud, Git for enterprise RM/CM
  • Relevant certifications (SAFe, Agile, DevOps, AI/ML)

Responsibilities

  • Focus on understanding why production incidents happen and how to prevent them from recurring.
  • Analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses.
  • Turn incident insights into high-quality postmortems.
  • Partner with engineering teams to drive corrective actions and long-term improvements.
  • Combine system-level thinking with data, automation, and AI-assisted analysis to shift the organization from reactive response to proactive reliability and incident prevention.
  • Partner with engineering and software development teams to implement permanent fix and preventive improvements.

Benefits

  • Medical/Dental/Vision coverage
  • 401(k) plan
  • Tuition reimbursement program
  • Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
  • Paid Parental Leave
  • Paid Caregiver Leave
  • Additional sick leave beyond what state and local law require may be available but is unprotected
  • Adoption Reimbursement
  • Disability Benefits (short term and long term)
  • Life and Accidental Death Insurance
  • Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
  • Employee Assistance Programs (EAP)
  • Extensive employee wellness programs
  • Employee discounts up to 50% off on eligible AT&T mobility plans and accessories
  • AT&T internet (and fiber where available) and AT&T phone.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service