SRE Consultant

NTT DATA Services, LLC

25d•Hybrid

About The Position

Site Reliability Engineering Consultant – New Jersey, Hybrid (2 days/wk) The Site Reliability Engineering Consultant, will be responsible for the development and overall implementation of software in a complex, critical and large cross-departmental and multi-disciplinary area. This role is part of a multi-year transformation journey that will require a successful candidate to establish best practices, motivate and promote a cultural shift that will ensure a successful adoption of Engineering Principles and Practices within Production Management. The role… requires a comprehensive understanding of multiple areas within a function and how they interact to achieve the objectives of the function. applies in-depth understanding of the business impact of technical contributions. is accountable for delivery of a full range of end-to-end projects. requires excellent communication skills required to negotiate internally. involves short- to medium-term planning of actions and resources for own area.

Requirements

Relevant experience in a critical software development role with high business impact, ability to understand how software delivers business value
Excellent engineering skills and senior architecture
Excellent working knowledge of key computer science concepts (networking, operating systems, virtualization, containerization, etc.)
Polyglot full-stack developer mentality and ability to pick up new languages and skills
Excellent understanding of Software Engineering concepts like Software Development Life Cycle and GitOps
Excellent debugging and analytical skills: ability to isolate root cause across networking/infrastructure, application and database stacks
Experience of delivering software using Agile delivery methodologies is a must (SCRUM/Kanban)
Strong experience with end-to-end observability stacks (Datadog, AppDynamics, Dynatrace, etc.) is desirable
Degree in computer science/mathematics/physics or related technical subject is desirable
Experience of senior stakeholder management
Consistently demonstrates clear and concise written and verbal communication skills
9+ years in a site reliability engineering related role with proven hands-on expertise and the capability to demonstrate technical proficiency in the following:
Programming (Java, Python, or equivalent)
Containerization
Kubernetes
GitOps
High Availability Systems
Infrastructure as a code
Configuration Management
Observability (tools and implementation)
Hyperscale Systems
Middleware configuration

Nice To Haves

Operational experience of deploying and running services at scale on top of Docker/Kubernetes stack and a service mesh, like Istio, is highly desirable
Operational experience with orchestration tools for CI/CD and Infrastructure-as-Code tooling (Terraform, Cloud Formation, etc.) is a highly desirable
Operational experience of using middleware technologies (MQ, Apache Kafka, etc.) to run services at scale is desirable

Responsibilities

Demonstrate an in-depth understanding of Software Development Lifecycle and how it integrates within the overall technology landscape to deliver scalable, reliable and resilient applications.
Ability to operate in a global environment with on-/near-/off-shore matrix reporting structures.
Operate into a highly regulated environment that requires in-depth understanding of the regulatory requirements and the industry implications for our technologies.
Improve the service level the team provides to our end users, which includes maximizing operational efficiencies, strengthening incident management, problem management and knowledge sharing practices.
Drive Continuous Delivery and Automation efforts across the supported applications by means of Root Cause Analysis reviews, Knowledge management, Performance tuning, and user training.
Foster a culture that promotes transparency and innovation for increased team productivity.
Coach members of the team and outside the immediate reporting line about the best practices and recognize anti-patterns that are quickly addressed.
Implement the Agile Framework through one of its implementations like SCRUM or Kanban and ensure it integrates with overall organization processes.
Avidly communicate progress and project status across the organization and ensure that stakeholders are managed appropriately throughout the execution period.