Senior Site Reliability Engineering
DKatalis
·
Posted:
July 20, 2023
·
Other
About the position
The Site Reliability Engineer role at DKatalis is responsible for maintaining the digital platform that serves Bank Jago's system and services. This includes optimizing Kubernetes, ensuring system uptime, debugging production issues, and automating recurring tasks. The SRE team aims to improve and uphold the reliability of the digital platform, collaborating with software squads to build a reliable system. The ideal candidate should have a strong software engineering background and a passion for system reliability and application performance.
Responsibilities
- Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
- Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
- Strong systems experience with good coding practice.
- Analytical approach to identifying problem components based on data points. Reliability of systems & applications is your core passion.
- Responsible for analyzing systems based on data points to identify workloads that are critical to the business.
- Comfortable working cross-functionally to ensure the success of the system's operation. Collaborating with other engineering and product teams to ensure expected system behavior is understood and monitoring exists to detect anomalies.
- Lead in-depth technical and data analysis to gauge service trends and drive improvements.
- Comfortable with on-call responsibility and able to manage a crisis working with the broader team, communicating progress and challenges during the crisis.
- Participate in continuous improvement and execution of quality and timely major incident root cause analysis and blameless post-mortem activities to avoid similar problems in the future.
- Contribute to prioritization of reliability features and contribute to the design, development, and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
- Contribute to proactive technical communication of reliability, stability, and efficiency results, service health, key reliability risks, and issues to senior business and technology stakeholders – to prioritize activity.
Requirements
- Strong software engineering background
- Experience in SRE software engineering and automation of operational tasks
- Ability to balance speed and quality in fixing issues
- Analytical approach to problem-solving and identifying critical components
- Comfortable working cross-functionally with other teams
- Proficient in technical and data analysis
- Willingness to take on-call responsibilities and manage crises
- Participate in continuous improvement and incident analysis
- Contribute to the design and development of reliability features and tooling
- Effective communication skills to report reliability and efficiency results to stakeholders
Benefits
- On-call responsibility
- Crisis management
- Continuous improvement
- Quality and timely major incident root cause analysis
- Blameless post-mortem activities
- Prioritization of reliability features
- Design, development, and delivery of effective tooling, alerts, and automated responses
- Proactive technical communication of reliability, stability, and efficiency results
- Communication of service health and key reliability risks to senior stakeholders
- Real interest and experience in Linux systems, networking, monitoring, and automation
- Software engineering skills to solve operational problems
- Automation of API-driven tasks at scale
- Experience in building and deploying software products in distributed systems
- Excellent communication skills (verbal and written)
- Ability to communicate incident status in business-friendly language
- 8+ years of experience in software development and/or SRE functions
- Degree in Computer Science, Engineering, or equivalent experience
- Experience and advanced understanding of Observability, CI/CD, and release management
- Knowledge of OS platforms, networking, web systems, and DevOps
- Experience with large-scale distributed systems and microservices architecture
- Strong organizational skills
- Ability to manage multiple tasks simultaneously
- Ability to work in a complex, fast-paced environment
- Ability to maintain calm during stressful situations