Senior Site Reliability Engineer
DKatalis
·
Posted:
August 28, 2023
·
Onsite
About the position
The Site Reliability Engineer role at DKatalis is responsible for maintaining the digital platform that serves Bank Jago's system and services. This includes optimizing Kubernetes, ensuring system uptime, debugging production issues, and automating recurring tasks. The SRE team aims to improve and uphold the reliability of the digital platform and collaborate with software squads. The ideal candidate should have a strong software engineering background and a passion for system reliability and application performance. They will also work cross-functionally with other engineering and product teams to ensure expected system behavior and drive continuous improvement.
Responsibilities
- Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
- Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
- Strong systems experience with good coding practice.
- Analytical approach to identifying problem components based on data points. Reliability of systems & applications is your core passion.
- Responsible for analyzing systems based on data points to identify workloads that are critical to the business.
- Comfortable working cross-functionally to ensure the success of the system's operation. Collaborate with other engineering and product teams to ensure expected system behavior is understood and monitoring exists to detect anomalies.
- Lead in-depth technical and data analysis to gauge service trends and drive improvements.
- Comfortable with on-call responsibility and able to manage a crisis working with the broader team, communicating progress and challenges during the crisis.
- Participate in continuous improvement and execution of quality and timely major incident root cause analysis and blameless post-mortem activities to avoid similar problems in the future.
- Contribute to prioritization of reliability features and contribute to the design, development, and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
- Contribute to proactive technical communication of reliability, stability, and efficiency results, service health, key reliability risks, and issues to senior business and technology stakeholders – to prioritize activity.
Requirements
- Strong software engineering background
- Experience in SRE software engineering and automation of operational tasks
- Ability to balance speed and quality in fixing issues
- Strong systems experience with good coding practice
- Analytical approach to problem identification based on data points
- Passion for reliability of systems and applications
- Ability to analyze systems based on data points to identify critical workloads
- Comfortable working cross-functionally with other engineering and product teams
- Ability to lead technical and data analysis to drive improvements
- Comfortable with on-call responsibility and crisis management
- Experience in incident root cause analysis and post-mortem activities
- Prioritization of reliability features and contribution to tooling and automated responses
- Strong communication skills to report reliability, stability, and efficiency results to stakeholders
Benefits
- On-call responsibility
- Crisis management
- Continuous improvement
- Quality and timely major incident root cause analysis
- Blameless post-mortem activities
- Prioritization of reliability features
- Design, development, and delivery of effective tooling, alerts, and automated responses
- Proactive technical communication of reliability, stability, and efficiency results
- Service health monitoring via dashboards
- Technical communication with senior business and technology stakeholders
- Software engineering skills
- Linux systems, networking, monitoring, and automation
- Experience in using software engineering to solve operational problems
- Automation of API-driven tasks at scale
- Experience in automating the build and deployment of software products
- Excellent communication skills
- Incident status communication
- 8+ years of experience in software development and/or SRE functions
- Degree in Computer Science, Engineering, or equivalent experience
- Experience and advanced understanding of Observability, CI/CD, and release management
- Knowledge of OS platforms (Linux/UNIX), Networking, Web Systems, and Dev Ops
- Experience with large-scale distributed systems and microservices architecture
- Strong organizational skills
- Ability to manage multiple tasks simultaneously
- Ability to work in a complex, fast-paced environment
- Ability to maintain calm during stressful situations