SRE Druid Support

InfosysAustin, TX
Onsite

About The Position

In the assigned Job Role of Infrastructure Consultant 2, your Area Of Responsibility will be as below: * Collaborate with internal and client teams to resolve complex incidents, conduct root cause analyses, and document findings with preventive recommendations * Participate in evaluation of client IT infrastructure, prepare actionable assessment reports, and support due diligence to document infrastructure maturity and improvement opportunities * Contribute to the design of scalable, cost-effective IT infrastructure solutions, review reusable components, and develop technical documentation for deployed systems * Align release schedules and environment readiness, execute deployments as per protocols, perform post-deployment testing, and manage version control to track changes * Co-ordinate maintenance schedules, emergency fixes, and technology upgrades while ensuring uninterrupted integration into existing systems and processes * Facilitate performance data analysis across systems, coordinate insights on system behavior, and support capacity planning to optimize performance * Conduct security checks, recovery drills, and compliance audits, implement security measures, and coordinate continuity plans to maintain adherence to standards * Gather feedback to identify automation opportunities, analyze existing infrastructure processes, and propose enhancements for efficiency gains * Act as liaison with onsite, offshore, and vendor teams to document project requirements, ensuring effective collaboration * Develop a centralized repository of technical and procedural knowledge, leveraging insights from other projects to drive efficiency and retain organizational expertise Your contribution to the team: * A collaborative spirit and excellent communication skills. * Ability to handle complex incidents and implement resolutions * A knack for conducting IT infrastructure assessment and identifying key optimization opportunities * Focused approach towards deployment management, system optimization, and process automation initiatives including sector specific focus * The ability to work with cross-functional teams

Requirements

  • Deep understanding and experience in administration & usage of Apache Druid at scale.
  • Deep understanding and experience in one or more of the following - Kubernetes, AWS, Hadoop, Flink, Docker, Spinnaker, Helm.
  • Understanding of SRE principles and goals along with good Oncall experience
  • Experience and understanding on Scaling, Capacity Planning and Disaster Recovery
  • This role involves close collaboration with systems and network engineers, DBAs, and monitoring and security teams.
  • As the primary point of contact for ingestion and query services, with a particular focus on technologies like Druid running across diverse environments including AWS, Kubernetes, and baremetal, leveraging your expertise in systems like Kafka, Flink, and Hadoop to ensure adherence to Service Level Agreements (SLAs).
  • Bachelor’s degree or foreign equivalent required from an accredited institution. Will also consider three years of progressive experience in the specialty in lieu of every year of education.

Nice To Haves

  • Experience working on supporting Java applications is a plus.
  • Experience using monitoring and logging solutions like Prometheus, Grafana, Splunk etc.
  • Experience in AWS, Hadoop

Responsibilities

  • Collaborate with internal and client teams to resolve complex incidents, conduct root cause analyses, and document findings with preventive recommendations
  • Participate in evaluation of client IT infrastructure, prepare actionable assessment reports, and support due diligence to document infrastructure maturity and improvement opportunities
  • Contribute to the design of scalable, cost-effective IT infrastructure solutions, review reusable components, and develop technical documentation for deployed systems
  • Align release schedules and environment readiness, execute deployments as per protocols, perform post-deployment testing, and manage version control to track changes
  • Co-ordinate maintenance schedules, emergency fixes, and technology upgrades while ensuring uninterrupted integration into existing systems and processes
  • Facilitate performance data analysis across systems, coordinate insights on system behavior, and support capacity planning to optimize performance
  • Conduct security checks, recovery drills, and compliance audits, implement security measures, and coordinate continuity plans to maintain adherence to standards
  • Gather feedback to identify automation opportunities, analyze existing infrastructure processes, and propose enhancements for efficiency gains
  • Act as liaison with onsite, offshore, and vendor teams to document project requirements, ensuring effective collaboration
  • Develop a centralized repository of technical and procedural knowledge, leveraging insights from other projects to drive efficiency and retain organizational expertise
  • Ensure 24x7 availability and stability of Druid and supporting platforms
  • Perform cluster operations (start, stop, restart, rolling upgrades)
  • Perform capacity planning and infrastructure scaling
  • Perform performance tuning and resource optimization
  • Participate in incident management, root cause analysis (RCA), and problem management
  • Prepare and maintain SOPs, runbooks, and operational documentation
  • Support change management, patching, upgrades, and security compliance
  • Develop and maintain code and documentation to solve critical challenges within some of the world's largest systems, and improve the entire service lifecycle from design to decommissioning

Benefits

  • Medical/Dental/Vision/Life Insurance
  • Long-term/Short-term Disability
  • Health and Dependent Care Reimbursement Accounts
  • Insurance (Accident, Critical Illness , Hospital Indemnity, Legal)
  • 401(k) plan and contributions dependent on salary level
  • Paid holidays plus Paid Time Off
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service