Site Reliability Engineer (Cloud Engineer)-I

Apply

Innovaccer

Posted:

August 29, 2023

Onsite

Job Commitment

Full-time

Experience Level

Mid Level

Workplace Type

Onsite

Job Function

Dev & Engineering

This job is closed

We regret to inform you that the job you were interested in has now been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

About the position

The Site Reliability Engineer (Cloud Engineer)-I will be responsible for building and automating secure cloud infrastructure, focusing on various pillars such as cost, reliability, scalability, and performance. They will collaborate with Dev and QA teams to drive continuous delivery and deployment, prioritize security by developing S-CICD and enabling security tool chains, and ensure observability of systems. Additionally, they will optimize system utilization, lead operations reviews, manage production and staging cloud platforms, and participate in incident response. The ideal candidate will have experience in DevOps/SRE, cloud automation, Kubernetes, scripting languages, and building scalable CICD architectures. Cloud security knowledge and strong communication skills are also important for this role.

Responsibilities

Building/automating secure cloud Infrastructure (Infrastructure As A Code - IaaC) with various pillars such as Cost, Reliability, Scalability, Performance, Deployment, Service Availability - SLA/SLO/SLI, etc.
Building CICD stack and driving the organization to a new level of continuous delivery and deployment.
Working closely with CISO and Dev team(s) to make security a priority and develop S-CICD (Secure CICD).
Enabling various security tool chains and vulnerability reports to developers via automation.
Driving observability charter spanning across logs, metrics, mesh, tracing, etc.
Collaborating closely with Dev and QA team to increase adoption of DevOps practices and tool chain.
Applying strong analytical skills to understand production system metrics, optimize system utilization, and drive cost efficiency.
Autoscaling/down the platform during peak season scenarios.
Understanding end-to-end platform architecture and performing triage/RCA by looking at various data points derived from observability tool chain.
Being part of the 24x7 OnCall Production Support team.
Leading monthly operations review with the executive team, including Platform/Application/Infrastructure KPIs, security reports, audit reports, etc.
Operating and managing production and staging cloud platforms, responsible for Ops and Site Reliability engineering.
Ensuring the platform is secured as per guidelines established by CISO.
Leading least privilege based RBAC for various production services and tool chains.
Building and executing Disaster Recovery plan.
Participating as a key stakeholder in case of Incident Response.
Proven work experience in DevOps/SRE.
Solid experience with at least one cloud platform (AWS, Azure, GCP) with automation focus.
Hands-on experience with Kubernetes and Linux.
Programming experience with scripting languages like Python.
Building scalable CICD architectures and solutions.
Building observability stack from logs, metrics, traces, service mesh, data observability.
Building reliability, scalability, and performance systems in Production.
Documenting and structuring documents for consumption by various dev teams.
Experience working in a Production environment with process focus.
Ticketing system and Incident management experience.
Cloud Security knowledge and skills.
Hands-on experience with technologies like Kafka, Postgres, SnowFlake, etc.
Bachelor's Degree or equivalent.
Able to perform under pressure situations without taking shortcuts.
Strong collaboration, verbal, and written communication skills.

Requirements

Proven work experience of 1-4 years in DevOps/SRE
Solid experience with at least one of the clouds with automation focus - AWS, Azure, GCP. Certification has advantages
Hands-on experience with Kubernetes along with Linux
Programming experience with scripting languages e.g. Python
Build and deployment experience building scalable CICD architectures and solutions is preferred
Building observability stack from logs, metrics, traces, service mesh, data observability is preferred
Building reliability, scalability and performance systems in Production. This requires significant engineering experience and risk evaluation
Good at documenting and structuring documents for consumption by various dev teams
Experience working in a Production environment with process focus is preferred
Ticketing system, Incident management experience is preferred
Cloud Security is a major advantage and highly preferred skill
Hands-on experience with a few of these - Kafka, Postgres, SnowFlake etc. is preferred
Bachelor’s Degree or equivalent
Able to perform with cool head under pressure situations without taking any shortcuts
Collaboration with solid verbal and oral communication skills are very critical to this role. Possesses excellent verbal and written communication skills and the ability to interact professionally with a diverse group of developers, product owners, and subject matter experts
Strong cross-functional collaboration skills, relationship building skills, and ability to work in a team environment
Strong analytical skills to understand production system metrics, drive change, optimize system utilization, and drive cost efficiency