About The Position

We are working with a long-standing anchor client to source a T3 Operations & Support Specialist (Compute & OS) for a large-scale cloud-native platform programme supporting a major energy transmission operator in Germany. The platform is a service-oriented hybrid cloud environment providing application teams with self-service capabilities to develop, run and operate software products across private and public cloud infrastructure. In this role you will provide Tier-3 operational ownership for Compute & Operating System services within Local Production (DE), handling complex incidents, deep troubleshooting and root cause analysis, and driving permanent fixes and preventive measures.

Requirements

  • 5 to 10+ years in IT operations, service delivery or platform operations with demonstrated leadership in mission-critical environments
  • Proven experience implementing and leading Incident, Problem, Change and Release governance in production
  • Hands-on experience with VMware 8 virtualisation
  • Operating Systems: Red Hat Enterprise Linux and Ubuntu
  • OS tooling: Satellite, IPA, Certificate Server
  • ITSM/collaboration tooling: Jira Service Management, Jira, Confluence
  • Fundamental understanding of core operations processes (Incident, Change, Problem management, ITSM) and SRE concepts
  • Experience gathering operational insights from monitoring/observability including SLI/SLA/SLO management and tracking
  • Hands-on experience documenting procedures and enforcing clear runbooks and playbooks
  • Hands-on experience with monitoring and logging tools (e.g. Prometheus, Grafana, Datadog, Mimir, Loki)
  • Understanding of modern platform operations (Kubernetes/containers, automation, observability) sufficient to govern specialists
  • Fluent English and German (C1 minimum in both)

Nice To Haves

  • Experience operating in regulated or high-availability industries (banking, telco, public sector, healthcare)
  • Experience with SRE practices (SLOs/SLIs, error budgets) and reliability management
  • Familiarity with enterprise DevOps toolchains (GitLab, JFrog Artifactory, Backstage, Harness)
  • GitOps and IaC awareness (Terraform/OpenTofu, ArgoCD, Helm)

Responsibilities

  • Providing T3 operational ownership for Compute & OS services: handling complex incidents, troubleshooting and RCA, and driving permanent fixes and preventive measures
  • Ensuring compute/OS readiness for releases and changes: monitoring/alerting coverage, performance baselines, hardening, patch strategy, rollback and recovery procedures, and runbooks
  • Executing and improving standard operational procedures through automation to reduce toil and improve MTTR and stability
  • Coordinating with Kubernetes, Data, Network and Storage SMEs to resolve cross-domain production issues
  • Validating deployment artefacts from an operations perspective and enforcing quality assurance measures
  • Monitoring system health, performance metrics and service availability across multi-tenant environments
  • Identifying, analysing and resolving incidents to minimise service disruption, and triggering RCA and corrective actions
  • Implementing monitoring and logging strategies to support audit and compliance requirements
  • Performing routine security scans and remediating identified vulnerabilities

Benefits

  • Flexible working hours
  • Freedom to choose your own projects
  • Access to exciting projects in various industries
  • Support in advancing your career
  • Competitive pay
  • Dedicated team to help you with any questions
  • Work independently
  • Utilise our strong network to achieve your professional goals
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service