About The Position

Granicus is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our SRE team. As a Senior SRE, you will play a pivotal role in ensuring the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure, automating processes, and guiding the team to implement best practices in site reliability.

Requirements

  • 5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems.
  • Experience supporting AI/ML infrastructure, including model deployment, inference optimization, and integration with services like AWS Bedrock is highly desirable.
  • Expertise in Linux/Unix systems, and cloud platforms (AWS, Azure, or Google Cloud).
  • Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
  • Familiarity with AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning.
  • Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, monitoring, and observability.
  • Experience with configuration management tools (Ansible, Chef, Puppet).
  • Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks.
  • Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, or similar are a plus.

Responsibilities

  • Provide production support on a shift according to the team on-call roster.
  • Work on customer and internal engineering/implementation team raised tickets while not on-call for production support.
  • Work on SREs backlog items.
  • Continuously monitor the health and performance of our services, systems, and infrastructure.
  • Respond to alerts and incidents promptly to ensure high availability.
  • Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
  • Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
  • Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
  • Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
  • Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
  • Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
  • Implement and adhere to security best practices to protect our systems and data.

Benefits

  • Flexible Time Off – Take the time you need to rest, recharge, and live your life.
  • Company-Wide Wellbeing Days – Paid days off to unplug and focus on your mental health.
  • Work From Home Reimbursement – Support a productive home office environment.
  • Multiple Health Plan Options – Including a 100% employer-paid plan.
  • Employer HSA Contributions – When enrolled in a High-Deductible Health Plan.
  • Fitness Reimbursement Program – Stay active, your way.
  • On-Demand Mental Health Support – Access to Headspace and other wellness tools.
  • Paid Parental Leave – For both birthing and non-birthing parents.
  • Traditional & Roth 401(k) – With a generous company match.
  • Life & AD&D Insurance – 100% employer-paid coverage for peace of mind.
  • Online Learning Platforms – Fuel your professional development.
  • Competitive Salary & Bonuses – Your contributions are valued and rewarded.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service