Site Reliability Operations Engineer III

PennymacWestlake Village, CA
113d$75,000 - $130,000

About The Position

Pennymac is a specialty financial services firm with a comprehensive mortgage platform and integrated business focused on the production and servicing of U.S. mortgage loans and the management of investments related to the U.S. mortgage market. At Pennymac, our people are the foundation of our success and at the heart of our dynamic work culture. Together, we work towards a unified goal of helping millions of Americans achieve aspirations of homeownership through the complete mortgage journey. As a member of the Site Reliability Operations (SRO) team, you will help provide 24/7 monitoring and support of the company’s IT Infrastructure. Ideal candidates should have experience in Windows and Linux administration, in addition to experience working in AWS, as Pennymac is now almost completely migrated into the AWS cloud. Individuals in this role should be comfortable working in a fast-paced environment. Multitasking, in addition to communicating quickly and accurately, is critical to the success of anyone in this role.

Requirements

  • Bachelor’s Degree in Computer Science or comparable experience.
  • 3–5+ years of experience working in both Windows and Linux environments.
  • Proven proficiency in monitoring and alerting tools such as Nagios, New Relic, SumoLogic, AWS CloudWatch.
  • Strong scripting or programming skills in PowerShell, Python, or a similar language.
  • Excellent organizational skills to manage competing priorities and urgent issues.
  • Strong written and verbal communication skills; able to explain complex technical issues to stakeholders.
  • Comfortable completing annual role-based training and certification assignments.
  • Demonstrated ability to work independently on complex tasks and collaborate effectively with cross-functional teams.

Nice To Haves

  • Advanced AWS Certifications strongly preferred.

Responsibilities

  • Oversee 24/7 health monitoring of the company’s IT Infrastructure using tools such as AWS CloudWatch and New Relic.
  • Contribute to the ongoing refinement of alerts and implement advanced alerting rules and thresholds.
  • Serve as an escalation point for complex incidents and collaborate with various teams for timely resolution.
  • Perform and troubleshoot a wide range of administrative tasks across Windows and Linux environments.
  • Handle complex tasks associated with maintaining and troubleshooting the company’s virtual infrastructure.
  • Tackle advanced technical issues escalated from Engineer I/II and conduct deep dives into logs.
  • Act as a liaison between internal teams and external vendors for high-priority incidents.
  • Strictly follow and help refine the company’s established Change Management processes.
  • Monitor and respond to incoming Calls, Chats, and Emails directed to the SRO team.
  • Lead by example in managing multiple ticket queues and take ownership of priority tickets.
  • Maintain and expand the SRO team’s knowledge base and author new Standard Operating Procedures (SOPs).
  • Coordinate and execute application and website code deployments using CI/CD tools.
  • Oversee backup tasks and ensure data retention meets corporate and regulatory requirements.
  • Drive or co-lead medium to large-scale projects related to infrastructure improvements.
  • Provide guidance to Engineer I and II staff on advanced troubleshooting methods.

Benefits

  • Comprehensive Medical, Dental, and Vision
  • Paid Time Off Programs including vacation, holidays, illness, and parental leave
  • Wellness Programs, Employee Recognition Programs, and onsite gyms and cafe style dining (select locations)
  • Retirement benefits, life insurance, 401k match, and tuition reimbursement
  • Philanthropy Programs including matching gifts, volunteer grants, charitable grants and corporate sponsorships
  • Competitive salary with potential bonus opportunities.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service