Lead Site Reliability Engineer

CardWorks•Pittsburgh, PA

49d•$146,032 - $162,257•Hybrid

About The Position

Join our team - and take the next step in achieving a fulfilling career! What We Do At CardWorks, we aim to help people connect with possibility and opportunity using our financial servicing expertise. Building meaningful, long-term relationships with consumers, our employees, and our clients is what matters most. Who We Are CardWorks, Inc. is a diversified consumer finance service provider and parent company of CardWorks Servicing, LLC, Merrick Bank and Carson Smithfield, LLC. CardWorks Servicing, LLC provides end- to-end operational servicing functions for credit cards, secured cards, and installment loans. We service consumer and small business loans across the credit spectrum and offers backup servicing and due diligence services to capital providers and trustees. Merrick Bank is an FDIC-insured Utah Industrial Loan Bank. Merrick operates three main business lines: credit cards, recreational lending, and merchant services. Carson Smithfield, LLC provides a variety of post-charge-off debt recovery services, including digital self-service, IVR, live agent, and external agency management.

Requirements

Experience in Site Reliability Engineering with a track record of delivering measurable improvements in uptime, scalability, release stability, and overall reliability in complex enterprise environments.
Demonstrated experience standing up or significantly maturing an SRE practice (operating model, SRE/service engagement, production readiness, incident/postmortem program, and reliability roadmap).
Hands-on experience applying AI/ML to operations (AIOps) or GenAI in production support workflows, with a focus on measurable outcomes (MTTD/MTTR, alert fatigue reduction, change failure rate) and responsible use controls.
Proven ability to establish Service Level Indicators (SLIs) and SLOs in production environments, including hands-on definition and implementation.
Demonstrated background in production incident response, leading resolution efforts, conducting blameless post-incident reviews, and implementing actionable remediation strategies.
Strong observability and telemetry expertise in designing instrumentation, building actionable dashboards and alerts, and delivering proactive reliability insights using metrics, logs, and traces.
Infrastructure engineering experience with strong Infrastructure as Code skills using tools such as Terraform and Ansible.
Thorough understanding and practical experience in CI/CD pipeline design, optimization, and troubleshooting using modern tooling and platforms such as Azure DevOps, GitHub Actions, Jenkins, or GitLab CI, with an emphasis on speed, reliability, and security.
Practical knowledge of containerization and platform modernization, including architecting and operating containerized workloads with Docker, VMware, and Kubernetes (or comparable orchestration platforms) to modernize legacy applications and improve fault tolerance.
Knowledge of emerging reliability practices, including SLO automation platforms, AIOps, or predictive operations to advance proactive reliability management.
Master’s degree in computer science, Engineering, or equivalent practical experience designing and operating production systems at scale.
7+ years of experience in Site Reliability Engineering.

Nice To Haves

Preferred certifications include AWS Professional, Terraform, Ansible, Azure DevOps, Octopus Deploy or other automation-focused credentials that demonstrate continuous technical development.

Responsibilities

Establish the SRE operating model (service onboarding, engagement model, governance, reliability reviews, production readiness standards, and quarterly planning) and ensure it is adopted across teams.
Identify, pilot, and operationalize AI-enabled reliability use cases (e.g., alert noise reduction, incident summarization, correlation/root-cause hypothesis generation, runbook assistance, and auto-remediation with human approval) with appropriate guardrails.
Define, implement, and operationalize reliability metrics by establishing and managing SLIs, SLOs, and error budgets to quantify and continuously improve service reliability, supporting engineering and business decisions.
Own the centralized SRE service engagement model by defining service tiers, onboarding criteria, reliability standards, and a transparent intake/prioritization process aligned to business criticality.
Define and enforce error budget policies (including escalation paths and release risk decisions) in partnership with Product/Engineering, using SLO attainment to guide trade-offs between feature velocity and reliability
Establish and maintain centralized “paved road” reliability standards and assets (instrumentation conventions, golden signals, alerting standards, runbook templates, SLO dashboards) that product teams can adopt with minimal friction.
Design the on-call and escalation model for a centralized SRE team (e.g., SRE overlay for major incidents, defined handoffs with service owners, and clear ownership boundaries) to improve response quality without creating single-team dependency.
Design and engineer automation and observability solutions by developing tooling, dashboards, and systems to reduce operational toil (measure, report, and drive toil down over time), enhance system visibility, and accelerate delivery.
Participates in incident and problem management by serving as incident coordinator for high-severity events, driving cross-functional responses, conducting blameless root cause analysis, running post-incident reviews (postmortems) with clear owners and due dates, ensuring remedial actions drive reliability improvements.
Oversee operational readiness and performance by managing capacity planning, validating disaster recovery, conducting production readiness reviews, and ensuring systems meet availability, scalability, and recovery expectations.
Partner with security, risk, and compliance teams to align reliability goals with governance and compliance requirements, ensuring secure, auditable, and well-documented practices.
Collaborate across the organization by working closely with end users, product management, development, architecture, and IT Operational teams to embed reliability principles throughout the software development lifecycle, including service onboarding, reliability reviews, and shared SLO ownership.
Champion reliability as a core product feature by promoting reliability throughout all phases of development, advocating for continuous improvement, and communicating key metrics and potential customer impact to stakeholders.
Train, mentor, and upskill engineering teams by coaching engineers in SRE practices, supporting junior team members, and fostering a culture of shared ownership and accountability for reliability, including influencing teams without direct authority through standards, data, and executive-aligned priorities.
Remain current on the latest SRE trends and best practices, including observability, AI-enabled operations (AIOps), and SLO management, and implement these methodologies to effectively support desired business outcomes.
Evaluate AI tools for reliability with security/privacy/compliance guardrails (e.g., data handling, prompt/content controls, auditability) and measure impact.
Participate in on-call rotations and operational support for SRE-supported systems and products.

Benefits

Competitive Pay, including a Bonus Target or Variable Pay Incentive Program
Benefits Package -Medical, Dental, and Vision (plus much more)
401(k) Plan with Company Match
Short- & Long-Term Disability
Wellness Programs
Group Life and AD&D Insurance
Paid Vacation, Sick Days and bank Holidays
Employee Engagement Activities including Employee Appreciation Day, DEI Employee Resource Groups, Corporate Social Responsibility, Service Recognition
We offer a total rewards package comprised of a competitive base rate of pay, variable pay incentive programs based on the role, and a comprehensive benefit suite.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume