SRE Druid Support

Infosys•Austin, TX

4d•Onsite

About The Position

In the assigned Job Role of Infrastructure Consultant 2, your Area Of Responsibility will be as below: * Collaborate with internal and client teams to resolve complex incidents, conduct root cause analyses, and document findings with preventive recommendations * Participate in evaluation of client IT infrastructure, prepare actionable assessment reports, and support due diligence to document infrastructure maturity and improvement opportunities * Contribute to the design of scalable, cost-effective IT infrastructure solutions, review reusable components, and develop technical documentation for deployed systems * Align release schedules and environment readiness, execute deployments as per protocols, perform post-deployment testing, and manage version control to track changes * Co-ordinate maintenance schedules, emergency fixes, and technology upgrades while ensuring uninterrupted integration into existing systems and processes * Facilitate performance data analysis across systems, coordinate insights on system behavior, and support capacity planning to optimize performance * Conduct security checks, recovery drills, and compliance audits, implement security measures, and coordinate continuity plans to maintain adherence to standards * Gather feedback to identify automation opportunities, analyze existing infrastructure processes, and propose enhancements for efficiency gains * Act as liaison with onsite, offshore, and vendor teams to document project requirements, ensuring effective collaboration * Develop a centralized repository of technical and procedural knowledge, leveraging insights from other projects to drive efficiency and retain organizational expertise Your contribution to the team: * A collaborative spirit and excellent communication skills. * Ability to handle complex incidents and implement resolutions * A knack for conducting IT infrastructure assessment and identifying key optimization opportunities * Focused approach towards deployment management, system optimization, and process automation initiatives including sector specific focus * The ability to work with cross-functional teams

Requirements

Deep understanding and experience in administration & usage of Apache Druid at scale.
Deep understanding and experience in one or more of the following - Kubernetes, AWS, Hadoop, Flink, Docker, Spinnaker, Helm.
Understanding of SRE principles and goals along with good Oncall experience
Experience and understanding on Scaling, Capacity Planning and Disaster Recovery
This role involves close collaboration with systems and network engineers, DBAs, and monitoring and security teams.
As the primary point of contact for ingestion and query services, with a particular focus on technologies like Druid running across diverse environments including AWS, Kubernetes, and baremetal, leveraging your expertise in systems like Kafka, Flink, and Hadoop to ensure adherence to Service Level Agreements (SLAs).
Bachelor’s degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education.

Nice To Haves

Experience working on supporting Java applications is a plus.
Experience using monitoring and logging solutions like Prometheus, Grafana, Splunk etc.
Experience in AWS, Hadoop

Responsibilities

Collaborate with internal and client teams to resolve complex incidents, conduct root cause analyses, and document findings with preventive recommendations
Participate in evaluation of client IT infrastructure, prepare actionable assessment reports, and support due diligence to document infrastructure maturity and improvement opportunities
Contribute to the design of scalable, cost-effective IT infrastructure solutions, review reusable components, and develop technical documentation for deployed systems
Align release schedules and environment readiness, execute deployments as per protocols, perform post-deployment testing, and manage version control to track changes
Co-ordinate maintenance schedules, emergency fixes, and technology upgrades while ensuring uninterrupted integration into existing systems and processes
Facilitate performance data analysis across systems, coordinate insights on system behavior, and support capacity planning to optimize performance
Conduct security checks, recovery drills, and compliance audits, implement security measures, and coordinate continuity plans to maintain adherence to standards
Gather feedback to identify automation opportunities, analyze existing infrastructure processes, and propose enhancements for efficiency gains
Act as liaison with onsite, offshore, and vendor teams to document project requirements, ensuring effective collaboration
Develop a centralized repository of technical and procedural knowledge, leveraging insights from other projects to drive efficiency and retain organizational expertise
Ensure 24x7 availability and stability of Druid and supporting platforms
Perform cluster operations (start, stop, restart, rolling upgrades)
Perform capacity planning and infrastructure scaling
Perform performance tuning and resource optimization
Participate in incident management, root cause analysis (RCA), and problem management
Prepare and maintain SOPs, runbooks, and operational documentation
Support change management, patching, upgrades, and security compliance
Develop and maintain code and documentation to solve critical challenges within some of the world's largest systems, and improve the entire service lifecycle from design to decommissioning