Platform Engineer

Stefanini GroupDearborn, MI
13dOnsite

About The Position

Stefanini Group is hiring! Stefanini is looking for a Platform Engineer in Dearborn, MI (Onsite) For quick apply, please reach out to Adil Khan at 248-728- 6424/ [email protected] We are looking for a Platform Engineer to help product teams deliver securely, reliably, and quickly. This role leans toward cloud infrastructure, DevOps, and Site Reliability Engineering (SRE), with strong software development skills.

Requirements

  • Managed production-grade infrastructure on major cloud platforms like GCP.
  • Designed multi-region GCP networks using VPCs, subnets, firewalls, and NAT, managed with Terraform and GitOps.
  • Strong understanding of networking, IAM boundaries, and tradeoffs between managed services and self-hosted solutions.
  • Built production-grade Python tools or automation with structured, testable, and maintainable code.
  • Automated tasks like querying GCP Asset Inventory, generating IAM reports, and creating tickets with retry/error handling.
  • Operated GCP services like Cloud Run, Workload Identity, Secret Manager, and VPC Service Controls.
  • Applied GCP-specific reliability and security patterns with hands-on experience.
  • Supported internal developer teams by handling on-call rotations, resolving incidents, and delivering systemic fixes.
  • Managed production Kubernetes clusters, performed upgrades, configured policies, and debugged issues.
  • Configured HPA/VPA for autoscaling and troubleshot pod scheduling and service mesh connectivity.
  • Strong understanding of Kubernetes control planes for debugging and management
  • Cloud Platforms : Experience managing production-grade systems on GCP, AWS, or Azure with an SRE mindset.
  • Linux & Networking : Strong fundamentals in Linux, distributed systems, and debugging production issues.
  • Infrastructure as Code : Skilled in tools like Terraform, Helm, Kustomize, and GitOps practices.
  • Containers & Orchestration : Proficient in Docker, Kubernetes, and modern CI/CD tools.
  • Programming : Experience with languages like Python, Go, Java, or TypeScript for building tools and automation.
  • Communication : Clear communicator with effective incident leadership and a customer-first approach.

Nice To Haves

  • Wrote Go for platform tooling or infrastructure automation.
  • Developed Kubernetes admission webhooks to enforce security policies or CLI tools for secret management.
  • Produced idiomatic Go with proper error handling, context propagation, and unit tests.
  • Contributed to or led the design of multi-team or multi-service platform architectures.
  • Designed shared service networks (hub-and-spoke models), CI/CD templates, and service mesh configurations.
  • Documented architecture patterns adopted by teams and articulated tradeoffs in design reviews.
  • Implemented SRE practices, including SLIs, SLOs, and error budgets.
  • Configured SLO-based alerting in Prometheus/Grafana and used burn rate alerts for incident management.
  • SLI/SLO Expertise : Experience defining SLIs/SLOs and implementing SLO-based alerting and dashboards.
  • Observability Platforms : Familiarity with Prometheus/Grafana, OpenTelemetry, and centralized logging.
  • Security Practices : Knowledge of policy-as-code, supply chain security, SBOMs, and artifact signing.
  • Standardized Solutions : Experience creating reusable golden paths (e.g., container images, templates, pipelines).
  • Cost Optimization : Skilled in FinOps practices, capacity planning, and multi-tenant platform controls.
  • Go : Proficient in writing idiomatic Go for platform tooling or infrastructure automation.
  • Cloud Architecture : Experience designing multi-service or multi-team platform architectures.
  • Reliability Engineering : Practical implementation of SRE practices, including SLIs, SLOs, error budgets, and alerting.

Responsibilities

  • Design and Operate Cloud Infrastructure: Build and manage cloud platforms, including networking, compute, Kubernetes, CI/CD, secrets, and identity.
  • Define Reliability Metrics: Establish and enhance SLIs, SLOs, and error budgets.
  • Implement Observability: Set up metrics, logs, and traces with actionable alerts.
  • Automate Workflows: Develop self-service workflows (e.g., infrastructure as code, GitOps, CI/CD pipelines) to reduce manual efforts.
  • Enhance Security & Compliance: Drive least-privilege access, secure defaults, and policy-as-code.
  • Incident Management: Participate in on-call rotations, handle incidents, lead postmortems, and deliver fixes.
  • Collaborate with Teams: Partner with application teams to improve deployability, resilience, and cost efficiency.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service