Site Reliability Operations Engineer III

PennyMacWestlake Village, CA
65d$75,000 - $130,000Onsite

About The Position

PENNYMAC Pennymac is (NYSE: PFSI) is a specialty financial services firm with a comprehensive mortgage platform and integrated business focused on the production and servicing of U.S. mortgage loans and the management of investments related to the U.S. mortgage market. At Pennymac, our people are the foundation of our success and at the heart of our dynamic work culture. Together, we work towards a unified goal of helping millions of Americans achieve aspirations of homeownership through the complete mortgage journey. A Typical Day As a member of the Site Reliability Operations (SRO) team, you will help provide 24/7 monitoring and support of the company's IT Infrastructure. Ideal candidates should have experience in Windows and Linux administration, in addition to experience working in AWS, as Pennymac is now almost completely migrated into the AWS cloud. Individuals in this role should be comfortable working in a fast-paced environment. Multitasking, in addition to communicating quickly and accurately, is critical to the success of anyone in this role. The Engineer III, Site Reliability Operations will:

Requirements

  • Bachelor's Degree in Computer Science or comparable experience.
  • 3-5+ years of experience working in both Windows and Linux environments, with demonstrated success in advanced troubleshooting and administration.
  • Proven proficiency in monitoring and alerting tools such as Nagios, New Relic, SumoLogic, AWS CloudWatch, and related technologies.
  • Strong scripting or programming skills in PowerShell, Python, or a similar language; ability to automate repetitive tasks and streamline operations.
  • Excellent organizational skills, with the ability to manage competing priorities and urgent issues in a fast-paced setting.
  • Strong written and verbal communication skills; able to explain complex technical issues to stakeholders at various technical levels.
  • Comfortable completing annual role-based training and certification assignments; dedicated to continual learning and development.
  • Demonstrated ability to work independently on complex tasks and to collaborate effectively with cross-functional teams.

Nice To Haves

  • Advanced AWS Certifications strongly preferred

Responsibilities

  • Monitoring - Oversee 24/7 health monitoring of the company's IT Infrastructure using tools such as AWS CloudWatch and New Relic. Identify gaps in monitoring coverage and propose enhancements
  • Alert Management - Contribute to the ongoing refinement of alerts. Implement advanced alerting rules and thresholds to proactively identify issues and reduce noise
  • Incident Management - Serve as an escalation point for complex incidents. Collaborate closely with the Incident Management team, Application Developers, Internal Support Teams, and 3rd Party Vendors to ensure timely and accurate resolution of service disruptions
  • Advanced Systems Administration - Perform and troubleshoot a wide range of administrative tasks across Windows and Linux environments. Assist in optimizing system performance, conducting root-cause analyses, and implementing long-term fixes
  • Virtual Server and Desktop Management - Handle more complex tasks associated with maintaining and troubleshooting the company's virtual infrastructure. Provide guidance to junior engineers for routine issues
  • Technical Troubleshooting and Investigation - Tackle advanced technical issues that are escalated from Engineer I/II. Conduct deep dives into infrastructure and application logs to pinpoint underlying problems
  • Internal and External Escalation - Act as a liaison between multiple internal teams and external vendors for high-priority incidents. Ensure swift coordination and minimize downtime
  • Change Management - Strictly follow and help refine the company's established Change Management processes. Provide risk assessments and validation for proposed changes before approval
  • Communication - Monitor and respond to incoming Calls, Chats, and Emails directed to the SRO team. Offer structured feedback to stakeholders when complex issues are underway
  • Ticket Queue Management - Lead by example in managing multiple ticket queues (ServiceNow, JIRA, etc.). Take ownership of priority tickets and oversee distribution among the team
  • Documentation - Maintain and expand the SRO team's knowledge base. Author new Standard Operating Procedures (SOPs) that incorporate best practices gained from resolving advanced incidents
  • Deployments - Coordinate and execute application and website code deployments using Jenkins, GitLab, or other CI/CD tools. Help optimize deployment workflows to reduce errors and downtime
  • Data Backup and Compliance - Oversee backup tasks using CommVault, AWS Backup, and related tools. Ensure data retention meets or exceeds corporate and regulatory requirements
  • Project Management - Drive or co-lead medium to large-scale projects related to infrastructure improvements, migrations, or optimizations. Collaborate with stakeholders to define scope, timelines, and resource needs
  • Mentorship - Provide guidance to Engineer I and II staff on advanced troubleshooting methods, best practices in cloud administration, and effective incident response

Benefits

  • Comprehensive Medical, Dental, and Vision
  • Paid Time Off Programs including vacation, holidays, illness, and parental leave
  • Wellness Programs, Employee Recognition Programs, and onsite gyms and cafe style dining (select locations)
  • Retirement benefits, life insurance, 401k match, and tuition reimbursement
  • Philanthropy Programs including matching gifts, volunteer grants, charitable grants and corporate sponsorships
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service