Senior Site Reliability Engineer

Guild Mortgage•,

1d•$94,882 - $136,096•Onsite

About The Position

The Senior Site Reliability Engineer is responsible for executing the organizational reliability strategy and participating in resiliency design reviews to ensure the reliability, scalability, and performance of our company's software systems and applications meet organizational service level objectives (SLOs) and error budgets. The role is responsible for designing, implementing, and maintaining the infrastructure and tools necessary to support our platforms, as well as improving our monitoring, automation, and deployment processes. This role involves strategic planning, technical leadership, and collaboration with various stakeholders including Guild’s Product Delivery, Data Services, DevOps, DataOps, and Infrastructure teams to support organizational goals.

Requirements

Bachelors Degree directly related to the position or equivalent, preferred.
A combination of education and experience may be considered in lieu of the Bachelor’s degree.
Minimum five years experience.
Collaborate with stakeholders to define RPO / RTO for Guild’s system footprint.
Expert in Cloud-based redundancy, high availability, and reliability strategies.
Expert in reliability, scalability, and performance optimization.
Expert at maintaining Linux / Unix and Windows systems administration, provisioning, configuration, monitoring, and troubleshooting Web Servers in a 7x24 customer facing environment.
Strong Linux and Windows Administration & scripting.
Solid Database Administration skills (MySQL, MariaDB, RDS, Sql Server, and Azure Storage services).
Deep knowledge of current methodologies in high performance operations and scalable multi-site implementations.
Proven Experience with large-scale software implementation (high transaction volume, high-availability concepts).
Deep knowledge of software deployment, versioning (GIT) and release management processes.
Experienced with infrastructure design, implementation, and support.
Proficient at automated provisioning, automated configuration management, and containerization solutions and tools.
Experienced in cloud-based hosting solutions (AWS, Azure, GCP).
Experienced with Cloud server environments (AWS, Google Cloud, or Azure).
Experienced in Agile software development best practices utilizing Continuous Integration & Delivery Pipelines as well as agile tools such as Jira.
Excellent written and verbal communication skills.
Proficient in communicating to both technical and management levels.
Ability to interact with external customers and staff members.
Highly adaptable.
Ability to work in a fast paced, constantly expanding environment.
Excellent verbal and written communication skills required.
Highly organized and detail-oriented; ability to work in a fast-paced, metrics-driven environment required.
Proficiency in Microsoft Office Suite, Word, Excel, Wiki, collaborative cloud-based programs, and third-party software applications required.
Commitment to company values.
Customer Service - Proactive attention to each person.
Integrity - Do and say what's right.
Respect - Treat others with dignity.
Collaboration - Listen and work together.
Learning - Seek knowledge and strive for improvement.
Excellence – Deliver the unexpected.

Responsibilities

Participate in resiliency design reviews and lead complex problem-solving efforts.
Design, implement, and maintain monitoring systems to track the performance, availability, and reliability of services.
Respond to incidents promptly, investigate root causes, and coordinate efforts to mitigate and resolve them.
Analyze performance data, and plan for scalability and capacity requirements.
Identify and optimize performance bottlenecks, both at the infrastructure and application levels.
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
Implement and enforce change management practices to ensure safe and controlled changes to the production environment.
Design and implement fault-tolerant systems and practices to minimize downtime and ensure service availability.
Collaborate with the GRC team on developing and maintaining disaster recovery plans and procedures relevant to the software supported to minimize the impact of catastrophic failures.
Work with the Incident Management and other teams to conduct a thorough analysis of incidents, document postmortem reports, and implement improvements based on lessons learned.
Work closely with development, operations, and other teams to foster a culture of reliability, and provide feedback on system design and architecture for improved reliability.