T3 Operations & Support Specialist — Compute & OS (PID9066)

Interval•Berlin, IL

10d•Remote

About The Position

We are working with a long-standing anchor client to source a T3 Operations & Support Specialist (Compute & OS) for a large-scale cloud-native platform programme supporting a major energy transmission operator in Germany. The platform is a service-oriented hybrid cloud environment providing application teams with self-service capabilities to develop, run and operate software products across private and public cloud infrastructure. In this role you will provide Tier-3 operational ownership for Compute & Operating System services within Local Production (DE), handling complex incidents, deep troubleshooting and root cause analysis, and driving permanent fixes and preventive measures.

Requirements

5 to 10+ years in IT operations, service delivery or platform operations with demonstrated leadership in mission-critical environments
Proven experience implementing and leading Incident, Problem, Change and Release governance in production
Hands-on experience with VMware 8 virtualisation
Operating Systems: Red Hat Enterprise Linux and Ubuntu
OS tooling: Satellite, IPA, Certificate Server
ITSM/collaboration tooling: Jira Service Management, Jira, Confluence
Fundamental understanding of core operations processes (Incident, Change, Problem management, ITSM) and SRE concepts
Experience gathering operational insights from monitoring/observability including SLI/SLA/SLO management and tracking
Hands-on experience documenting procedures and enforcing clear runbooks and playbooks
Hands-on experience with monitoring and logging tools (e.g. Prometheus, Grafana, Datadog, Mimir, Loki)
Understanding of modern platform operations (Kubernetes/containers, automation, observability) sufficient to govern specialists
Fluent English and German (C1 minimum in both)

Nice To Haves

Experience operating in regulated or high-availability industries (banking, telco, public sector, healthcare)
Experience with SRE practices (SLOs/SLIs, error budgets) and reliability management
Familiarity with enterprise DevOps toolchains (GitLab, JFrog Artifactory, Backstage, Harness)
GitOps and IaC awareness (Terraform/OpenTofu, ArgoCD, Helm)

Responsibilities

Providing T3 operational ownership for Compute & OS services: handling complex incidents, troubleshooting and RCA, and driving permanent fixes and preventive measures
Ensuring compute/OS readiness for releases and changes: monitoring/alerting coverage, performance baselines, hardening, patch strategy, rollback and recovery procedures, and runbooks
Executing and improving standard operational procedures through automation to reduce toil and improve MTTR and stability
Coordinating with Kubernetes, Data, Network and Storage SMEs to resolve cross-domain production issues
Validating deployment artefacts from an operations perspective and enforcing quality assurance measures
Monitoring system health, performance metrics and service availability across multi-tenant environments
Identifying, analysing and resolving incidents to minimise service disruption, and triggering RCA and corrective actions
Implementing monitoring and logging strategies to support audit and compliance requirements
Performing routine security scans and remediating identified vulnerabilities