Join AT&T and help shape the future of communications and technology that connect the world. We value innovators who seek to explore the unknown and challenge the status quo. Bring your bold ideas and fearless spirit to redefine connectivity and transform how people share stories and experiences. At AT&T, you won’t just imagine the future—you’ll build it. Lead System Engineer As a Tier 2/Site Reliability Engineer (SRE), you will translate core business requirements into robust, scalable, and reliable technical solutions. You’ll play a pivotal role in designing and implementing applications, platforms, and services that power critical business operations, with a strong emphasis on high availability, performance, and compliance in cloud, messaging, and data environments. Individual will possess the experience & skills that includes a hybrid of traditional T2/SRE operations technical skills to support our Project Growth apps and new, evolving Generative AI and Workflow Automation skillsets needed to drive operational efficiency and scalability. Provide technical expertise and best practices for Java, Python, JavaScript, and Perl-based solutions. Practical understanding of AI/ML concepts and their integration in enterprise platforms. Key Responsibilities The EngOps Tier 2/SRE team ensures applications and systems are highly reliable, scalable, and performant while fostering a collaborative culture between development and operations. Work with T1 team on incident as Triage lead during outages or critical issues Pager duty issues Minimize downtime and user impact during incidents. Conduct detailed After Action Reviews involving all stakeholders and chalk out short term and long-term resiliency options. Eliminate recurrence of similar issues through systemic fixes. Define and implement monitoring and alerting strategies tailored to the launch. Collaborate with Product development teams to gain deep insight into the application architecture, flows and critical dependencies. Monitor and evaluate key performance metrics like latency, throughput, and error rates and update alerts Propose architectural or operational changes to prevent reoccurrence Reduce Mean Time to Resolution (MTTR) for incidents.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level