Senior Site Reliability Engineer

Apply

DKatalis

Posted:

August 28, 2023

Onsite

Job Commitment

Full-time

Experience Level

Senior

Workplace Type

Onsite

Job Function

Dev & Engineering

This job is closed

We regret to inform you that the job you were interested in has now been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

About the position

The Site Reliability Engineer role at DKatalis is responsible for maintaining the digital platform that serves Bank Jago's system and services. This includes optimizing Kubernetes, ensuring system uptime, debugging production issues, and automating recurring tasks. The SRE team aims to improve and uphold the reliability of the digital platform and collaborate with software squads. The ideal candidate should have a strong software engineering background and a passion for system reliability and application performance. They will also work cross-functionally with other engineering and product teams to ensure expected system behavior and drive continuous improvement.

Responsibilities

Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
Strong systems experience with good coding practice.
Analytical approach to identifying problem components based on data points. Reliability of systems & applications is your core passion.
Responsible for analyzing systems based on data points to identify workloads that are critical to the business.
Comfortable working cross-functionally to ensure the success of the system's operation. Collaborate with other engineering and product teams to ensure expected system behavior is understood and monitoring exists to detect anomalies.
Lead in-depth technical and data analysis to gauge service trends and drive improvements.
Comfortable with on-call responsibility and able to manage a crisis working with the broader team, communicating progress and challenges during the crisis.
Participate in continuous improvement and execution of quality and timely major incident root cause analysis and blameless post-mortem activities to avoid similar problems in the future.
Contribute to prioritization of reliability features and contribute to the design, development, and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
Contribute to proactive technical communication of reliability, stability, and efficiency results, service health, key reliability risks, and issues to senior business and technology stakeholders – to prioritize activity.

Requirements

Strong software engineering background
Experience in SRE software engineering and automation of operational tasks
Ability to balance speed and quality in fixing issues
Strong systems experience with good coding practice
Analytical approach to problem identification based on data points
Passion for reliability of systems and applications
Ability to analyze systems based on data points to identify critical workloads
Comfortable working cross-functionally with other engineering and product teams
Ability to lead technical and data analysis to drive improvements
Comfortable with on-call responsibility and crisis management
Experience in incident root cause analysis and post-mortem activities
Prioritization of reliability features and contribution to tooling and automated responses
Strong communication skills to report reliability, stability, and efficiency results to stakeholders

Benefits

On-call responsibility
Crisis management
Continuous improvement
Quality and timely major incident root cause analysis
Blameless post-mortem activities
Prioritization of reliability features
Design, development, and delivery of effective tooling, alerts, and automated responses
Proactive technical communication of reliability, stability, and efficiency results
Service health monitoring via dashboards
Technical communication with senior business and technology stakeholders
Software engineering skills
Linux systems, networking, monitoring, and automation
Experience in using software engineering to solve operational problems
Automation of API-driven tasks at scale
Experience in automating the build and deployment of software products
Excellent communication skills
Incident status communication
8+ years of experience in software development and/or SRE functions
Degree in Computer Science, Engineering, or equivalent experience
Experience and advanced understanding of Observability, CI/CD, and release management
Knowledge of OS platforms (Linux/UNIX), Networking, Web Systems, and Dev Ops
Experience with large-scale distributed systems and microservices architecture
Strong organizational skills
Ability to manage multiple tasks simultaneously
Ability to work in a complex, fast-paced environment
Ability to maintain calm during stressful situations

Learn more about DKatalis employee perks and benefits.

Job Application Resources

Resume Name

Subtext

No items found.

More Openings at DKatalis

DevSecOps

DKatalis

Web Design

Other

Full-time

Dev & Engineering

Mid Level

101-250

Employees