Sr. System Engineer (AI Automation Engineer SRE Focus)

AT&T•Alpharetta, GA

6d•Onsite

About The Position

This position requires office presence of a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered. Join AT&T and help shape the future of communications and technology that connect the world. We value innovators who seek to explore the unknown and challenge the status quo. Bring your bold ideas and fearless spirit to redefine connectivity and transform how people share stories and experiences. At AT&T, you won’t just imagine the future—you’ll build it. Sr. System Engineer (AI Automation Engineer SRE Focus) Role Overview - AI-Driven Reliability, Automation & Platform Engineering We are seeking a Lead AI Automation Engineer with a strong Site Reliability Engineering (SRE) mindset to design, implement, and operate AI-driven automation and intelligent reliability capabilities across mission‑critical Front Office (CRM) and Back Office (Supply Chain, Logistics, and ERP) platforms. This role sits at the intersection of AI automation, AIOps, platform reliability, and enterprise application engineering. You will leverage Generative AI, Large Language Models (LLMs), Agentic AI, and autonomous automation frameworks to dramatically improve system resilience, incident response, observability, and operational efficiency across complex Oracle-based and SaaS ecosystems. You will be accountable not just for keeping systems running, but for engineering self-healing, predictive, and continuously improving platforms that reduce human toil, prevent incidents before they occur, and scale reliably as the business grows. What You’ll Do AI-Driven Reliability & Automation Engineering Architect and deliver AI-powered automation solutions for production operations, including intelligent incident triage, root cause analysis, remediation, and prevention. Design Agentic AI workflows that autonomously monitor systems, analyze anomalies, trigger corrective actions, and orchestrate recovery across ERP, supply chain, and integration layers. Apply AIOps techniques to correlate metrics, logs, events, and traces for predictive alerting, noise reduction, and proactive reliability improvements. Develop LLM-enabled runbooks and intelligent assistants to guide operational decision-making, accelerate incident response, and upskill operations teams. Site Reliability Engineering (SRE) & Production Operations Own platform stability, uptime, and performance across Oracle EBS/ERP, Oracle Fusion Cloud, and supply chain execution systems. Lead incident management, coordinating rapid response, containing impact, and ensuring SLA adherence. Conduct blameless postmortems, using AI-assisted RCA to identify systemic issues and drive automation-first corrective actions. Partner with development teams to embed reliability, scalability, and observability requirements into system design and delivery. Enterprise Application & Supply Chain Support Provide advanced production support for Oracle EBS/ERP modules including Procurement, Order Management, Inventory, AR, AP, FA, Project Accounting, and Supply Chain Planning. Support end-to-end supply chain flows including Procure-to-Pay, Order-to-Cash, inventory transactions, fulfillment, shipping, and reconciliation processes. Troubleshoot complex issues across configuration, master data, transactions, batch jobs, interfaces, and integrations, leveraging deep SQL and system-level analysis. Monitor and support 3rd-party platforms (O9, Blue Yonder/JDA, RELEX) and integrations with WMS, 3PL, and logistics providers. Observability, Monitoring & Intelligence Build and evolve AI-augmented observability solutions using tools such as Dynatrace, AppDynamics, Splunk, ELK, Grafana, and custom ML models. Implement predictive health monitoring, capacity forecasting, and intelligent service-level indicators (SLIs/SLOs). Replace static alerts with context-aware, AI-ranked alerts that reduce noise and accelerate resolution. Create autonomous dashboards that surface actionable insights rather than raw metrics. Integration & Automation Excellence Diagnose and remediate integration failures across Oracle SOA/OIC, MuleSoft, Kafka/JMS, EDI, and event-driven architectures. Automate error handling, replay, deduplication, and reconciliation for high-volume interfaces using AI-assisted logic. Collaborate with middleware, cloud, and vendor teams to resolve cross-system defects, data mismatches, latency issues, and sequencing problems. Continuously identify and eliminate manual operational toil through intelligent automation and self-service tooling. Release, Cloud & Platform Engineering Support release management, ensuring changes meet reliability, security, and performance standards. Apply DevOps and SRE practices including automation-first deployments, rollback strategies, and resilience testing. Leverage cloud-native and containerized platforms (Docker, Kubernetes, Azure) to support scalable, resilient workloads. Participate in on-call rotations, with a strong emphasis on automation and AI-driven reduction of recurring incidents.

Requirements

4+ years of experience across enterprise application engineering, SRE, and production operations, with an automation-first mindset.
Proven experience driving AI-based automation, AIOps, or intelligent operational tooling in complex enterprise environments.
Strong ownership mentality for system reliability, performance, and customer impact.
Hands-on experience with Generative AI, LLMs, or Agentic AI frameworks applied to automation, monitoring, or operations.
Proficiency in Python, Shell scripting, SQL/PLSQL, and automation frameworks.
Deep experience with Oracle EBS and/or Oracle Fusion Cloud (AR, AP, FA, PO, INV, OM, PA, Planning).
Strong knowledge of observability platforms: Dynatrace, AppDynamics, Splunk, ELK, Grafana.
Experience with integration technologies: Oracle SOA/OIC, MuleSoft, Kafka/JMS, EDI.
Familiarity with containers and cloud platforms (Docker, Kubernetes, Azure).
Exceptional problem-solving, analytical, and systems-thinking abilities.
Strong communication skills, capable of explaining complex AI-driven and technical concepts to both technical and non-technical stakeholders.
Experience leading incidents, facilitating postmortems, and driving cultural adoption of blameless SRE principles.
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field.

Nice To Haves

Experience building AI-enhanced runbooks, chatbots, or autonomous operational workflows is highly desirable.
Ability to translate operational patterns into repeatable, intelligent automation.

Responsibilities

Architect and deliver AI-powered automation solutions for production operations, including intelligent incident triage, root cause analysis, remediation, and prevention.
Design Agentic AI workflows that autonomously monitor systems, analyze anomalies, trigger corrective actions, and orchestrate recovery across ERP, supply chain, and integration layers.
Apply AIOps techniques to correlate metrics, logs, events, and traces for predictive alerting, noise reduction, and proactive reliability improvements.
Develop LLM-enabled runbooks and intelligent assistants to guide operational decision-making, accelerate incident response, and upskill operations teams.
Own platform stability, uptime, and performance across Oracle EBS/ERP, Oracle Fusion Cloud, and supply chain execution systems.
Lead incident management, coordinating rapid response, containing impact, and ensuring SLA adherence.
Conduct blameless postmortems, using AI-assisted RCA to identify systemic issues and drive automation-first corrective actions.
Partner with development teams to embed reliability, scalability, and observability requirements into system design and delivery.
Provide advanced production support for Oracle EBS/ERP modules including Procurement, Order Management, Inventory, AR, AP, FA, Project Accounting, and Supply Chain Planning.
Support end-to-end supply chain flows including Procure-to-Pay, Order-to-Cash, inventory transactions, fulfillment, shipping, and reconciliation processes.
Troubleshoot complex issues across configuration, master data, transactions, batch jobs, interfaces, and integrations, leveraging deep SQL and system-level analysis.
Monitor and support 3rd-party platforms (O9, Blue Yonder/JDA, RELEX) and integrations with WMS, 3PL, and logistics providers.
Build and evolve AI-augmented observability solutions using tools such as Dynatrace, AppDynamics, Splunk, ELK, Grafana, and custom ML models.
Implement predictive health monitoring, capacity forecasting, and intelligent service-level indicators (SLIs/SLOs).
Replace static alerts with context-aware, AI-ranked alerts that reduce noise and accelerate resolution.
Create autonomous dashboards that surface actionable insights rather than raw metrics.
Diagnose and remediate integration failures across Oracle SOA/OIC, MuleSoft, Kafka/JMS, EDI, and event-driven architectures.
Automate error handling, replay, deduplication, and reconciliation for high-volume interfaces using AI-assisted logic.
Collaborate with middleware, cloud, and vendor teams to resolve cross-system defects, data mismatches, latency issues, and sequencing problems.
Continuously identify and eliminate manual operational toil through intelligent automation and self-service tooling.
Support release management, ensuring changes meet reliability, security, and performance standards.
Apply DevOps and SRE practices including automation-first deployments, rollback strategies, and resilience testing.
Leverage cloud-native and containerized platforms (Docker, Kubernetes, Azure) to support scalable, resilient workloads.
Participate in on-call rotations, with a strong emphasis on automation and AI-driven reduction of recurring incidents.

Benefits

Medical/Dental/Vision coverage
401(k) plan
Tuition reimbursement program
Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
Paid Parental Leave
Paid Caregiver Leave
Additional sick leave beyond what state and local law require may be available but is unprotected
Adoption Reimbursement
Disability Benefits (short term and long term)
Life and Accidental Death Insurance
Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
Employee Assistance Programs (EAP)
Extensive employee wellness programs
Employee discounts up to 50% off on eligible AT&T mobility plans and accessories, AT&T internet (and fiber where available) and AT&T phone.