Site Reliability Engineer

Geotab•Atlanta, GA

17d•Hybrid

About The Position

As a part of the Site Reliability Engineering team, your key area of responsibility is to ensure the availability, reliability, and performance of Geotab's core products for our customers. This role acts as a primary escalation point, diagnosing and resolving complex application issues impacting service availability and performance of multiple large scale applications that support thousands of customers globally. SRE supports production applications and infrastructure, focusing on restoring normal service operations efficiently and contributing to long-term system stability.

Requirements

3 - 5 years experience in SRE/DevOps/Tier 3.
Strong troubleshooting skills with a systematic problem-solving approach.
Extensive experience resolving critical incidents in production environments.
Strong proficiency in Linux and operational scripting (Bash, Powershell, Python).
Experience with database/dataset querying (GoogleSQL, PostgreSQL, BigData), automated configuration management (via tools like Ansible), and GitOps tools (Argo CD).
Experience with data visualization platforms (e.g., Apache Superset/BigQuery Visualizations).
Familiarity with cloud platforms (GCP/Azure/AWS), container orchestration (Kubernetes), and monitoring/alerting systems (e.g., Prometheus stack including AlertManager/Grafana).
Understanding of application environments (e.g., .NET/C#) for troubleshooting purposes.
Demonstrated ability to work well under pressure and manage multiple tasks and projects simultaneously.
Experience with incident management processes.
Excellent verbal and written communication skills.
Strong analytical skills with the ability to problem solve and develop well-judged decisions.
Strong team player with the ability to engage with all levels of the organization.
Technical competence using software programs, including but not limited to, Google Suite for business (Sheets, Docs, Slides) or equivalents
Entrepreneurial mindset and comfortable in a flat organization.
To be eligible, candidates must have continuously resided in the continental United States for at least three years immediately preceding their application. Successful applicants will be required to provide verifiable documentation of continuous lawful residency. Some exceptions may apply to US citizens.
Ability to pass an enhanced background check, including a drug screening test (if applicable) and a credit check.

Nice To Haves

Understanding of fundamental networking concepts (TCP/IP, HTTP, DNS, Load Balancing) are considered assets.
Experience working within a technical or engineering organization with knowledge of the high-technology industry is considered an asset.
Familiarity with applying AI-powered tools to enhance operational efficiency in areas such as log analysis, troubleshooting assistance, incident summarization, and automation scripting.

Responsibilities

Act as a primary escalation point for critical production application/product issues.
Rapidly troubleshoot complex problems across the application stack, utilizing observability tools to identify root causes.
Coordinate effectively with development, infrastructure, and other technical teams during incidents to implement fixes and restore service swiftly.
Clearly communicate incident status, impact, and resolution steps to internal stakeholders.
Collaborate with team members to improve monitoring tools, dashboards, and alerting mechanisms for proactive detection of issues impacting Critical User Journeys (CUJs) within the application/product and computing architecture. Our complex environment encompasses monolithic applications, microservices, and a vast ecosystem of millions of hardware units.
Monitor application/product and system health proactively using a combination of tools to ensure high availability and adherence to Service Level Objectives (SLOs) / Service Level Agreements (SLAs).
Identify opportunities and implement automation tools/scripts to streamline routine operational tasks, reduce manual effort (toil), and improve response times.
Conduct system tests to validate performance, reliability, and successful remediation of issues.
Recommend design and process enhancements based on operational experience to improve overall application reliability and maintainability.
Participate in post major incident reviews (PMIRs) to analyze disruptions, document findings, track corrective actions to prevent recurrence, and identify areas of improvement for incident response processes.
Contribute to building a culture of learning from incidents.
Participate in a 24x7 on-call rotation to provide timely support for critical issues outside of business hours.