Site Reliability Engineer 3

Granicus LACPR
98d

About The Position

Granicus is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our SRE team. As a Senior SRE, you will play a pivotal role in ensuring the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure, automating processes, and guiding the team to implement best practices in site reliability.

Requirements

  • Good understanding of Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud).
  • Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
  • 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems.
  • Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud).
  • Proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
  • Advanced knowledge of monitoring and logging tools (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines.
  • Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently.
  • Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders.
  • Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives.
  • Relevant certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or similar.
  • In-depth understanding of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation).
  • Experience with database management (SQL, NoSQL), load balancing, and distributed systems.

Responsibilities

  • Provide production support on a shift according to the team on-call roster.
  • Work on customer and internal engineering/implementation team raised tickets while not on-call for production support.
  • Work on SREs backlog items.
  • Continuously monitor the health and performance of our services, systems, and infrastructure.
  • Respond to alerts and incidents promptly to ensure high availability.
  • Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
  • Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
  • Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
  • Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
  • Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
  • Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
  • Implement and adhere to security best practices to protect our systems and data.

Benefits

  • Flexible Time Off – Take the time you need to rest, recharge, and live your life.
  • Company-Wide Wellbeing Days – Paid days off to unplug and focus on your mental health.
  • Work From Home Reimbursement – Support a productive home office environment.
  • Private health, vision, dental and life insurance – 100% Employer-Paid. Comprehensive coverage for you and your family.
  • On-Demand Mental Health Support – Access to Headspace and other wellness tools.
  • Pension Plan and Retirement options.
  • Online Learning Platforms – Fuel your professional development.
  • Competitive Salary & Bonuses – Your contributions are valued and rewarded.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service