Site Reliability Engineering Manager (Manager Digital Solutions)

IDEMIA

About The Position

IDEMIA Public Security, a division of IDEMIA Group, is the leading provider of secure and trusted biometric-based solutions, transforming public and private organizations across the globe. Our industry-enabled and client-specific solutions draw upon decades of expertise in biometrics to revolutionize the fields of public security, justice and public safety, travel and transport, identity, and access control. Built on privacy and trust, our market-leading iris, fingerprint and facial recognition solutions top independent benchmarking for accuracy, fairness and scalability. These exacting standards enable our clients to build safer, fairer societies where people can live, interact, and move about freely. With 4000+ employees around the world and 150+ partners worldwide, we offer more than just a job - we provide a dynamic environment where innovation thrives, opportunities abound, and your talents are valued. Be part of a global leader shaping the future of biometric based technology. Learn more here. We are seeking a highly organized and strategic technical leader to manage the SRE team driving IDEMIA’s critical Identity Verification Platform. In this role, you will bridge the gap between advanced engineering and customer success. We need a delivery-focused manager who thrives on bringing complex technical solutions over the finish line while maintaining impeccable platform reliability and security. You will lead SRE teams responsible for production reliability, partner closely with software engineering, security, and infrastructure teams, and establish reliability practices that enable safe, scalable innovation in regulated environments.

Requirements

Exceptional Leadership: Proven experience managing and scaling high-performing SRE or Cloud Engineering teams, backed by elite organizational and project management skills.
Execution Focus: A strong track record of steering complex, enterprise-grade technical solutions from architectural design to successful, stable deployment.
Technical Expertise: Deep architectural and operational knowledge of our core stack: AWS (including GovCloud environments), Kubernetes, Helm charts, KMS (Key Management Service), and Kafka for event-driven and streaming architectures. Demonstrated proficiency utilizing Terraform for Infrastructure as Code (IaC), alongside modern monitoring and observability ecosystems.
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
8+ years of experience in site reliability engineering, infrastructure, or DevOps roles.
4+ years of people management experience, including senior engineers or managers.
Strong experience supporting high‑availability, production systems.
Deep understanding of Linux, distributed systems, networking, and cloud infrastructure.
Proven experience with incident response, problem management, and operational excellence.
Hands‑on experience with: Cloud platforms (AWS, Azure, and/or GCP) Infrastructure as Code (Terraform, CloudFormation, Ansible) CI/CD pipelines and deployment automation

Nice To Haves

Experience working in regulated or high‑security environments (government, public safety, identity, financial, or similar).
Background in software engineering or platform engineering.
Experience with containerized and orchestrated environments (Docker, Kubernetes).
Familiarity with observability tools (Prometheus, Grafana, Datadog, Splunk, ELK).
Experience supporting compliance frameworks (SOC 2, ISO 27001, FedRAMP, CJIS, etc.).

Responsibilities

Lead, mentor, and scale multiple SRE teams supporting critical production systems.
Build a culture of ownership, accountability, and continuous improvement.
Define team structure, capacity planning, and career development paths.
Act as a senior leader during critical incidents and executive escalations.
Act as the primary customer-facing support liaison during pre-sales engagements, post-sales integrations, and ongoing day-to-day operations.
Partner closely with clients to resolve post-integration technical inquiries, guide them through our solutions, and ensure a seamless, fully supported experience throughout their entire lifecycle.
Own system reliability, availability, resilience, and operational readiness.
Define and manage SLIs, SLOs, and error budgets aligned with business and customer commitments.
Lead incident management, root cause analysis (RCA), and post‑incident remediation in a blameless culture.
Ensure production environments meet contractual, regulatory, and security requirements.
Drive automation across infrastructure provisioning, deployments, monitoring, and recovery.
Champion Infrastructure as Code, CI/CD pipelines, and self‑healing systems.
Reduce operational toil and manual intervention through tooling and architectural improvements.
Oversee the end-to-end deployment, operation, and active monitoring of a highly secure, high-traffic cloud infrastructure, including the implementation of robust alerting and comprehensive testing.
Lead and guide the team in establishing best practices for monitoring, incident response, and quality assurance, ensuring reliability, scalability, and continuous improvement of the platform.
Partner closely with Product and Development teams to ensure operational readiness for new features.
Monitor and strategically optimize cloud infrastructure costs without compromising platform performance.
Collaborate with compliance and security stakeholders to support audits and regulatory obligations.
Communicate clearly with leadership on operational risks, reliability posture, and improvement plans.