Senior Site Reliability Engineer

Entrust•Shakopee, MN

24d•$134,477 - $197,232•Onsite

About The Position

At Entrust, we’re shaping the future of identity centric security solutions. From our comprehensive portfolio of solutions to our flexible, global workplace, we empower careers, foster collaboration, and build solutions that help keep the world moving safely . Get to Know Us Headquartered in Minnesota, Entrust is an industry leader in identity-centric security solutions, serving over 150 countries with cutting-edge , scalable technologies. But our secret weapon? Our people. It’s the curiosity , dedication, and innovation that drive our success and help us anticipate the future. Position Overview: The Instant Financial Issuance as a Service (IFIaaS) Cloud Service includes a wide array of components including web services, application servers, and databases hosted in an on-prem environment. The Sr. Site Reliability Engineer (SRE) will be responsible for ensuring that the SaaS platform is reliable, available, and performant, as well as scalable, secure, and cost-effective. Ultimately, the individual will be responsible for the platform uptime, functional management of all the IFIaaS cloud environments, applications, networks, scoping projects, and the resolution of application and network issues.

Requirements

Bachelor’s degree in computer science, Software Engineering, or equivalent combination of education and experience
5+ years of related experience as a Software Engineer, DevOps Engineer, Site Reliability Engineer or a role in similar capacity
Extensive experience working with enterprise level micro-services applications, including deployment and maintenance of the applications in distributed environments.
Demonstrated hands-on experience and expertise with DevOps tooling (Ansible, Terraform, Jenkins, Octopus deploy, etc.) networks, network security, high-level managerial skills
In-Depth hands-on experience with on-prem and cloud compute, storage and networking solutions (vmWare, NetApp, Azure, AWS, etc)

Responsibilities

Own SLOs/SLIs for availability (99.9%), latency, error rate, and quality of service across microservices.
Design/operate end‑to‑end observability: metrics, logs, traces, synthetic checks, real‑user monitoring (RUM).
Instrument services (Windows services, APIs, background jobs) with structured logs and trace context.
Build health probes and SLA monitors for critical transactions and cross-service dependencies.
Monitor system issues using various metrics, such as uptime, latency, error rate, throughput, and availability
Deploy and maintain monitoring and on-call tools i.e.: Splunk on-call, Prometheus, Datadog, etc.
Lead incident response (triage, comms, coordination, real-time mitigation) and conduct blameless postmortems with actionable follow-ups.
Maintain and continuously improve runbooks, escalation paths, on call rotations, and paging policies.
Implement MTTA/MTTR reduction programs.
Stand up war room protocols and ensure stakeholder updates during incidents.
Forecast compute, storage, network needs, track headroom against growth and peak patterns.
Conduct performance profiling and bottleneck analyses (CPU, memory, I/O, thread pools, connection pools).
Optimize resource allocation on VMware (DRS, affinity rules, reservations) and Windows VM tuning (kernel, TCP stack, NICs).
Validate scaling strategies (horizontal vs. vertical) and implement auto-scaling where supported.
Standardize gold images, configuration baselines, and desired state for Windows Server (PowerShell DSC or equivalent).
Manage patching (OS, middleware, runtime) with maintenance windows aligned to error budgets.
Ensure backup, snapshot, and restore strategies meet RPO/RTO; regularly test restores.
Maintain secure baselines (CIS benchmarks for Windows/VMware), vulnerability management, and patch cadence.
Support compliance audits (PCI-CP, PCI-DSS, SOC 2/ISO 27001), produce evidence (configs, logs, access reviews), and remediate gaps.
Automate provisioning (VM templates, DSC/Ansible for Windows, Terraform for VMware) and configuration drift detection/correction.
Build runbooks to reduce toil (deploy, scale, rollback, etc)
Create reliability guardrails (pre‑flight checks, change freeze rules, policy controls) as code.
Continuously refactor scripts/runbooks into idempotent automation.
Collaborate with development teams and other stakeholders to identify potential risks, such as security vulnerabilities, performance bottlenecks, deployment issues, or configuration errors implement various risk mitigation strategies, such as patching, backup, redundancy, encryption, or testing
Collaborate with product teams and other teams to understand the user needs, expectations, and satisfaction.
Coach engineers on SRE principles, incident handling, and reliability centric design.
Lead knowledge sharing, runbooks quality, and postmortem culture (blameless, action-oriented).
Provide after-hours support for production issues on a rotational basis with other team members to ensure system availability 24/7/365.

Benefits

In addition to your pay, Entrust offers eligible colleagues and their dependents comprehensive health and well-being programs which include medical, vision, dental, a generous 401(k) matching contribution, life and disability insurance, mental health coaching, virtual fitness programs, paid personal time off plus 12 paid holidays, parental leave and education reimbursement.
In addition to your pay, Entrust offers eligible colleagues and their dependents comprehensive benefits, vacation, paid time off and paid holidays.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume