This position requires office presence of a minimum of 5 days per week and is only located in the location(s) posted. No relocation is offered. Join AT&T and help shape the future of communications and technology that connect the world. We value innovators who seek to explore the unknown and challenge the status quo. Bring your bold ideas and fearless spirit to redefine connectivity and transform how people share stories and experiences. At AT&T, you won’t just imagine the future—you’ll build it. Sr. System Engineer (AI Automation Engineer SRE Focus) Role Overview - AI-Driven Reliability, Automation & Platform Engineering We are seeking a Lead AI Automation Engineer with a strong Site Reliability Engineering (SRE) mindset to design, implement, and operate AI-driven automation and intelligent reliability capabilities across mission‑critical Front Office (CRM) and Back Office (Supply Chain, Logistics, and ERP) platforms. This role sits at the intersection of AI automation, AIOps, platform reliability, and enterprise application engineering. You will leverage Generative AI, Large Language Models (LLMs), Agentic AI, and autonomous automation frameworks to dramatically improve system resilience, incident response, observability, and operational efficiency across complex Oracle-based and SaaS ecosystems. You will be accountable not just for keeping systems running, but for engineering self-healing, predictive, and continuously improving platforms that reduce human toil, prevent incidents before they occur, and scale reliably as the business grows. What You’ll Do AI-Driven Reliability & Automation Engineering Architect and deliver AI-powered automation solutions for production operations, including intelligent incident triage, root cause analysis, remediation, and prevention. Design Agentic AI workflows that autonomously monitor systems, analyze anomalies, trigger corrective actions, and orchestrate recovery across ERP, supply chain, and integration layers. Apply AIOps techniques to correlate metrics, logs, events, and traces for predictive alerting, noise reduction, and proactive reliability improvements. Develop LLM-enabled runbooks and intelligent assistants to guide operational decision-making, accelerate incident response, and upskill operations teams. Site Reliability Engineering (SRE) & Production Operations Own platform stability, uptime, and performance across Oracle EBS/ERP, Oracle Fusion Cloud, and supply chain execution systems. Lead incident management, coordinating rapid response, containing impact, and ensuring SLA adherence. Conduct blameless postmortems, using AI-assisted RCA to identify systemic issues and drive automation-first corrective actions. Partner with development teams to embed reliability, scalability, and observability requirements into system design and delivery. Enterprise Application & Supply Chain Support Provide advanced production support for Oracle EBS/ERP modules including Procurement, Order Management, Inventory, AR, AP, FA, Project Accounting, and Supply Chain Planning. Support end-to-end supply chain flows including Procure-to-Pay, Order-to-Cash, inventory transactions, fulfillment, shipping, and reconciliation processes. Troubleshoot complex issues across configuration, master data, transactions, batch jobs, interfaces, and integrations, leveraging deep SQL and system-level analysis. Monitor and support 3rd-party platforms (O9, Blue Yonder/JDA, RELEX) and integrations with WMS, 3PL, and logistics providers. Observability, Monitoring & Intelligence Build and evolve AI-augmented observability solutions using tools such as Dynatrace, AppDynamics, Splunk, ELK, Grafana, and custom ML models. Implement predictive health monitoring, capacity forecasting, and intelligent service-level indicators (SLIs/SLOs). Replace static alerts with context-aware, AI-ranked alerts that reduce noise and accelerate resolution. Create autonomous dashboards that surface actionable insights rather than raw metrics. Integration & Automation Excellence Diagnose and remediate integration failures across Oracle SOA/OIC, MuleSoft, Kafka/JMS, EDI, and event-driven architectures. Automate error handling, replay, deduplication, and reconciliation for high-volume interfaces using AI-assisted logic. Collaborate with middleware, cloud, and vendor teams to resolve cross-system defects, data mismatches, latency issues, and sequencing problems. Continuously identify and eliminate manual operational toil through intelligent automation and self-service tooling. Release, Cloud & Platform Engineering Support release management, ensuring changes meet reliability, security, and performance standards. Apply DevOps and SRE practices including automation-first deployments, rollback strategies, and resilience testing. Leverage cloud-native and containerized platforms (Docker, Kubernetes, Azure) to support scalable, resilient workloads. Participate in on-call rotations, with a strong emphasis on automation and AI-driven reduction of recurring incidents.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees