Senior Site Reliability Engineering

Apply

DKatalis

Posted:

July 20, 2023

Other

Job Commitment

Full-time

Experience Level

Senior

Workplace Type

Other

Job Function

Dev & Engineering

This job is closed

We regret to inform you that the job you were interested in has now been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

About the position

The Site Reliability Engineer role at DKatalis is responsible for maintaining the digital platform that serves Bank Jago's system and services. This includes optimizing Kubernetes, ensuring system uptime, debugging production issues, and automating recurring tasks. The SRE team aims to improve and uphold the reliability of the digital platform, collaborating with software squads to build a reliable system. The ideal candidate should have a strong software engineering background and a passion for system reliability and application performance.

Responsibilities

Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
Strong systems experience with good coding practice.
Analytical approach to identifying problem components based on data points. Reliability of systems & applications is your core passion.
Responsible for analyzing systems based on data points to identify workloads that are critical to the business.
Comfortable working cross-functionally to ensure the success of the system's operation. Collaborating with other engineering and product teams to ensure expected system behavior is understood and monitoring exists to detect anomalies.
Lead in-depth technical and data analysis to gauge service trends and drive improvements.
Comfortable with on-call responsibility and able to manage a crisis working with the broader team, communicating progress and challenges during the crisis.
Participate in continuous improvement and execution of quality and timely major incident root cause analysis and blameless post-mortem activities to avoid similar problems in the future.
Contribute to prioritization of reliability features and contribute to the design, development, and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
Contribute to proactive technical communication of reliability, stability, and efficiency results, service health, key reliability risks, and issues to senior business and technology stakeholders – to prioritize activity.

Requirements

Strong software engineering background
Experience in SRE software engineering and automation of operational tasks
Ability to balance speed and quality in fixing issues
Analytical approach to problem-solving and identifying critical components
Comfortable working cross-functionally with other teams
Proficient in technical and data analysis
Willingness to take on-call responsibilities and manage crises
Participate in continuous improvement and incident analysis
Contribute to the design and development of reliability features and tooling
Effective communication skills to report reliability and efficiency results to stakeholders

Benefits

On-call responsibility
Crisis management
Continuous improvement
Quality and timely major incident root cause analysis
Blameless post-mortem activities
Prioritization of reliability features
Design, development, and delivery of effective tooling, alerts, and automated responses
Proactive technical communication of reliability, stability, and efficiency results
Communication of service health and key reliability risks to senior stakeholders
Real interest and experience in Linux systems, networking, monitoring, and automation
Software engineering skills to solve operational problems
Automation of API-driven tasks at scale
Experience in building and deploying software products in distributed systems
Excellent communication skills (verbal and written)
Ability to communicate incident status in business-friendly language
8+ years of experience in software development and/or SRE functions
Degree in Computer Science, Engineering, or equivalent experience
Experience and advanced understanding of Observability, CI/CD, and release management
Knowledge of OS platforms, networking, web systems, and DevOps
Experience with large-scale distributed systems and microservices architecture
Strong organizational skills
Ability to manage multiple tasks simultaneously
Ability to work in a complex, fast-paced environment
Ability to maintain calm during stressful situations

Learn more about DKatalis employee perks and benefits.

Job Application Resources

Resume Name

Subtext

No items found.

More Openings at DKatalis

DevSecOps

DKatalis

Web Design

Other

Full-time

Dev & Engineering

Mid Level

101-250

Employees

Senior Software Engineer

DKatalis

Web Design

Onsite

Full-time

Dev & Engineering

Senior

101-250

Employees