About The Position

Join AT&T and help shape the future of communications and technology that connect the world. We value innovators who seek to explore the unknown and challenge the status quo. Bring your bold ideas and fearless spirit to redefine connectivity and transform how people share stories and experiences. At AT&T, you won’t just imagine the future—you’ll build it. Lead System Engineer As a Tier 2/Site Reliability Engineer (SRE), you will translate core business requirements into robust, scalable, and reliable technical solutions. You’ll play a pivotal role in designing and implementing applications, platforms, and services that power critical business operations, with a strong emphasis on high availability, performance, and compliance in cloud, messaging, and data environments. Individual will possess the experience & skills that includes a hybrid of traditional T2/SRE operations technical skills to support our Project Growth apps and new, evolving Generative AI and Workflow Automation skillsets needed to drive operational efficiency and scalability. Provide technical expertise and best practices for Java, Python, JavaScript, and Perl-based solutions. Strong knowledge of network and telecom standards (3GPP, TM Forum, etc.). Practical understanding of AI/ML concepts and their integration in enterprise platforms.

Requirements

  • Education: Bachelor’s degree in computer science, Information Systems, or a related discipline.
  • Experience: Over 10 years hands-on experience in architecting and building scalable platforms and applications in cloud/data environments.
  • Provide technical expertise and best practices for Java, Python, JavaScript, and Perl-based solutions.
  • Strong knowledge of network and telecom standards (3GPP, TM Forum, etc.).

Nice To Haves

  • Practical understanding of AI/ML concepts and their integration in enterprise platforms.

Responsibilities

  • The EngOps Tier 2/SRE team ensures applications and systems are highly reliable, scalable, and performant while fostering a collaborative culture between development and operations.
  • Work with T1 team on incident as Triage lead during outages or critical issues
  • Pager duty issues
  • Minimize downtime and user impact during incidents.
  • Conduct detailed After Action Reviews involving all stakeholders and chalk out short term and long-term resiliency options.
  • Eliminate recurrence of similar issues through systemic fixes.
  • Define and implement monitoring and alerting strategies tailored to the launch.
  • Collaborate with Product development teams to gain deep insight into the application architecture, flows and critical dependencies.
  • Monitor and evaluate key performance metrics like latency, throughput, and error rates and update alerts
  • Propose architectural or operational changes to prevent reoccurrence
  • Reduce Mean Time to Resolution (MTTR) for incidents.

Benefits

  • Medical/Dental/Vision coverage
  • 401(k) plan
  • Tuition reimbursement program
  • Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
  • Paid Parental Leave
  • Paid Caregiver Leave
  • Additional sick leave beyond what state and local law require may be available but is unprotected
  • Adoption Reimbursement
  • Disability Benefits (short term and long term)
  • Life and Accidental Death Insurance
  • Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
  • Employee Assistance Programs (EAP)
  • Extensive employee wellness programs
  • Employee discounts up to 50% off on eligible AT&T mobility plans and accessories, AT&T internet (and fiber where available) and AT&T phone.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service