Director, Site Reliability Engineering

NBCUniversalNew York, NY
4hRemote

About The Position

As a member of NBCUniversal’s Production Software Engineering team, responsible for leading and performing custom architectural design, implementation, monitoring, and maintenance for a portfolio of production application environments. Responsible for hands-on configuration and support as well as managing the work of other architects and engineers. Work closely with our Principal Software Engineer on technical architecture and design based on customer product requirements, translating product requirements to technical designs and implementations. Collaborate with cross-functional team members such as Scrum Leads, Software Engineers, QA Engineers, UX Designers, Product Managers, other Architects & Site Reliability Engineers (Contractors and/or Staff), and third-party vendors. Effectively delegate responsibilities to team members, mentoring and providing them with repeatable processes, and verifying the quality of their work. Utilize metrics to measure accomplishments and monitors progress, ensuring milestones and projects are completed on-time. Communicate progress and the impact of solutions in technical terms to technology partners and in business terms to business partners. Establish a reputation as the subject matter expert for every tech stack used in Production Software Engineering applications and how they all fit together while keeping current with new technologies, developing innovative technical ideas, and generating proposals. Work with product teams to learn business objectives, development teams to plan platform needs, QA to understand test strategy, and SRE on environments and deployments. Participate in Scrums, demos, and other Agile ceremonies and ensure accurate and timely status updates to the team. Serve as primary interface with the NBCU Cyber Security team for all security-related initiatives, patching, remediations, etc. Hands-on commissioning, configuration, administration, documentation, and support for all on-prem & cloud (AWS) environments (Servers, Storage, Databases, Networking, Security, etc.). Technical impact analysis, implementation, and monitoring of all cyber, technology audit, enterprise engineering, & IT (Databases, Monitoring, etc.) activities related to Production Software Engineering applications and platforms. Create and manage CI/CD pipelines using tool likes Cloud Formation, Foreman, Jenkins, Nexus, Rundeck, Ansible, and Puppet. Lead implementation of monitoring and reporting framework using tools like Grafana, Influx, Graylog/Splunk, Selenium, New Relic, and Icinga. Recognize and identify potential technical impacts of enterprise change controls which could affect our applications and customers. Help improve performance, scalability, and reliability. Build and maintain distributed infrastructure and automation. Solve problems quickly and automates processes for the future. Direct management of other engineers and architects (Contractors and/or Staff). 24x7x365 availability for production outages, emergencies, and deployments. 100% telecommuting is permitted for this role.

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, or related field (or foreign degree equivalent), plus 10 years of experience as a Software Architect, in the job offered, or in a related occupation.
  • Hands-on systems engineering experience on Linux/Unix platforms
  • Experience with technical leadership and people management
  • Experience with Continuous Delivery and SDLC practices
  • DevOps principles, experience with operational tools (Ansible or Puppet or Chef, Terraform) and best practices for infrastructure (on-prem or cloud) and software deployment
  • Operational experience with large scale applications
  • Experience with NoSQL data stores (MarkLogic, MongoDB, Cassandra, DynamoDB, Couchbase, PostgreSQL, etc.)
  • Experience with a broad range of enterprise technologies
  • Experience building real-time, large-scale, low-latency distributed systems
  • Experience with Agile tools like Jira, GitHub or similar.
  • Experience using AWS Cloud in a production environment
  • Experience with AWS IAM, EC2, RDS, S3, Lambda, batch and step functions.

Responsibilities

  • responsible for leading and performing custom architectural design, implementation, monitoring, and maintenance for a portfolio of production application environments.
  • Responsible for hands-on configuration and support as well as managing the work of other architects and engineers.
  • Work closely with our Principal Software Engineer on technical architecture and design based on customer product requirements, translating product requirements to technical designs and implementations.
  • Collaborate with cross-functional team members such as Scrum Leads, Software Engineers, QA Engineers, UX Designers, Product Managers, other Architects & Site Reliability Engineers (Contractors and/or Staff), and third-party vendors.
  • Effectively delegate responsibilities to team members, mentoring and providing them with repeatable processes, and verifying the quality of their work.
  • Utilize metrics to measure accomplishments and monitors progress, ensuring milestones and projects are completed on-time.
  • Communicate progress and the impact of solutions in technical terms to technology partners and in business terms to business partners.
  • Establish a reputation as the subject matter expert for every tech stack used in Production Software Engineering applications and how they all fit together while keeping current with new technologies, developing innovative technical ideas, and generating proposals.
  • Work with product teams to learn business objectives, development teams to plan platform needs, QA to understand test strategy, and SRE on environments and deployments.
  • Participate in Scrums, demos, and other Agile ceremonies and ensure accurate and timely status updates to the team.
  • Serve as primary interface with the NBCU Cyber Security team for all security-related initiatives, patching, remediations, etc.
  • Hands-on commissioning, configuration, administration, documentation, and support for all on-prem & cloud (AWS) environments (Servers, Storage, Databases, Networking, Security, etc.).
  • Technical impact analysis, implementation, and monitoring of all cyber, technology audit, enterprise engineering, & IT (Databases, Monitoring, etc.) activities related to Production Software Engineering applications and platforms.
  • Create and manage CI/CD pipelines using tool likes Cloud Formation, Foreman, Jenkins, Nexus, Rundeck, Ansible, and Puppet.
  • Lead implementation of monitoring and reporting framework using tools like Grafana, Influx, Graylog/Splunk, Selenium, New Relic, and Icinga.
  • Recognize and identify potential technical impacts of enterprise change controls which could affect our applications and customers.
  • Help improve performance, scalability, and reliability.
  • Build and maintain distributed infrastructure and automation.
  • Solve problems quickly and automates processes for the future.
  • Direct management of other engineers and architects (Contractors and/or Staff). 24x7x365 availability for production outages, emergencies, and deployments.

Benefits

  • medical
  • dental
  • vision insurance
  • 401(k)
  • paid leave
  • tuition reimbursement
  • a variety of other discounts and perks
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service