Site Reliability Engineer (Cloud Engineer)-I
Innovaccer
·
Posted:
August 29, 2023
·
Onsite
About the position
The Site Reliability Engineer (Cloud Engineer)-I will be responsible for building and automating secure cloud infrastructure, focusing on various pillars such as cost, reliability, scalability, and performance. They will collaborate with Dev and QA teams to drive continuous delivery and deployment, prioritize security by developing S-CICD and enabling security tool chains, and ensure observability of systems. Additionally, they will optimize system utilization, lead operations reviews, manage production and staging cloud platforms, and participate in incident response. The ideal candidate will have experience in DevOps/SRE, cloud automation, Kubernetes, scripting languages, and building scalable CICD architectures. Cloud security knowledge and strong communication skills are also important for this role.
Responsibilities
- Building/automating secure cloud Infrastructure (Infrastructure As A Code - IaaC) with various pillars such as Cost, Reliability, Scalability, Performance, Deployment, Service Availability - SLA/SLO/SLI, etc.
- Building CICD stack and driving the organization to a new level of continuous delivery and deployment.
- Working closely with CISO and Dev team(s) to make security a priority and develop S-CICD (Secure CICD).
- Enabling various security tool chains and vulnerability reports to developers via automation.
- Driving observability charter spanning across logs, metrics, mesh, tracing, etc.
- Collaborating closely with Dev and QA team to increase adoption of DevOps practices and tool chain.
- Applying strong analytical skills to understand production system metrics, optimize system utilization, and drive cost efficiency.
- Autoscaling/down the platform during peak season scenarios.
- Understanding end-to-end platform architecture and performing triage/RCA by looking at various data points derived from observability tool chain.
- Being part of the 24x7 OnCall Production Support team.
- Leading monthly operations review with the executive team, including Platform/Application/Infrastructure KPIs, security reports, audit reports, etc.
- Operating and managing production and staging cloud platforms, responsible for Ops and Site Reliability engineering.
- Ensuring the platform is secured as per guidelines established by CISO.
- Leading least privilege based RBAC for various production services and tool chains.
- Building and executing Disaster Recovery plan.
- Participating as a key stakeholder in case of Incident Response.
- Proven work experience in DevOps/SRE.
- Solid experience with at least one cloud platform (AWS, Azure, GCP) with automation focus.
- Hands-on experience with Kubernetes and Linux.
- Programming experience with scripting languages like Python.
- Building scalable CICD architectures and solutions.
- Building observability stack from logs, metrics, traces, service mesh, data observability.
- Building reliability, scalability, and performance systems in Production.
- Documenting and structuring documents for consumption by various dev teams.
- Experience working in a Production environment with process focus.
- Ticketing system and Incident management experience.
- Cloud Security knowledge and skills.
- Hands-on experience with technologies like Kafka, Postgres, SnowFlake, etc.
- Bachelor's Degree or equivalent.
- Able to perform under pressure situations without taking shortcuts.
- Strong collaboration, verbal, and written communication skills.
Requirements
- Proven work experience of 1-4 years in DevOps/SRE
- Solid experience with at least one of the clouds with automation focus - AWS, Azure, GCP. Certification has advantages
- Hands-on experience with Kubernetes along with Linux
- Programming experience with scripting languages e.g. Python
- Build and deployment experience building scalable CICD architectures and solutions is preferred
- Building observability stack from logs, metrics, traces, service mesh, data observability is preferred
- Building reliability, scalability and performance systems in Production. This requires significant engineering experience and risk evaluation
- Good at documenting and structuring documents for consumption by various dev teams
- Experience working in a Production environment with process focus is preferred
- Ticketing system, Incident management experience is preferred
- Cloud Security is a major advantage and highly preferred skill
- Hands-on experience with a few of these - Kafka, Postgres, SnowFlake etc. is preferred
- Bachelor’s Degree or equivalent
- Able to perform with cool head under pressure situations without taking any shortcuts
- Collaboration with solid verbal and oral communication skills are very critical to this role. Possesses excellent verbal and written communication skills and the ability to interact professionally with a diverse group of developers, product owners, and subject matter experts
- Strong cross-functional collaboration skills, relationship building skills, and ability to work in a team environment
- Strong analytical skills to understand production system metrics, drive change, optimize system utilization, and drive cost efficiency
Benefits
- Industry-Focused Certifications
- Rewards and Recognition
- Health Insurance and Mental Well-being
- Sabbatical Leave Policy
- Open Floor Plan
- Paternity and Maternity Leave