Lead Site Reliability Engineer (SRE)

AT&T

4d•Onsite

About The Position

Join AT&T and help shape the future of communications and technology that connect the world. We value innovators who seek to explore the unknown and challenge the status quo. Bring your bold ideas and fearless spirit to redefine connectivity and transform how people share stories and experiences. At AT&T, you won’t just imagine the future—you’ll build it. Lead System Engineer As a Tier 2/Site Reliability Engineer (SRE), you will translate core business requirements into robust, scalable, and reliable technical solutions. You’ll play a pivotal role in designing and implementing applications, platforms, and services that power critical business operations, with a strong emphasis on high availability, performance, and compliance in cloud, messaging, and data environments. Individual will possess the experience & skills that includes a hybrid of traditional T2/SRE operations technical skills to support our Project Growth apps and new, evolving Generative AI and Workflow Automation skillsets needed to drive operational efficiency and scalability. Provide technical expertise and best practices for Java, Python, JavaScript, and Perl-based solutions. Strong knowledge of network and telecom standards (3GPP, TM Forum, etc.). Practical understanding of AI/ML concepts and their integration in enterprise platforms.

Requirements

Education: Bachelor’s degree in computer science, Information Systems, or a related discipline.
Experience: Over 10 years hands-on experience in architecting and building scalable platforms and applications in cloud/data environments.
Provide technical expertise and best practices for Java, Python, JavaScript, and Perl-based solutions.
Strong knowledge of network and telecom standards (3GPP, TM Forum, etc.).

Nice To Haves

Practical understanding of AI/ML concepts and their integration in enterprise platforms.

Responsibilities

The EngOps Tier 2/SRE team ensures applications and systems are highly reliable, scalable, and performant while fostering a collaborative culture between development and operations.
Work with T1 team on incident as Triage lead during outages or critical issues
Pager duty issues
Minimize downtime and user impact during incidents.
Conduct detailed After Action Reviews involving all stakeholders and chalk out short term and long-term resiliency options.
Eliminate recurrence of similar issues through systemic fixes.
Define and implement monitoring and alerting strategies tailored to the launch.
Collaborate with Product development teams to gain deep insight into the application architecture, flows and critical dependencies.
Monitor and evaluate key performance metrics like latency, throughput, and error rates and update alerts
Propose architectural or operational changes to prevent reoccurrence
Reduce Mean Time to Resolution (MTTR) for incidents.

Benefits

Medical/Dental/Vision coverage
401(k) plan
Tuition reimbursement program
Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
Paid Parental Leave
Paid Caregiver Leave
Additional sick leave beyond what state and local law require may be available but is unprotected
Adoption Reimbursement
Disability Benefits (short term and long term)
Life and Accidental Death Insurance
Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
Employee Assistance Programs (EAP)
Extensive employee wellness programs
Employee discounts up to 50% off on eligible AT&T mobility plans and accessories, AT&T internet (and fiber where available) and AT&T phone.