Platform Engineer

Stefanini Group•Dearborn, MI

66d•Onsite

About The Position

Stefanini Group is hiring! Stefanini is looking for a Platform Engineer in Dearborn, MI (Onsite) For quick apply, please reach out to Adil Khan at 248-728- 6424/ [email protected] We are looking for a Platform Engineer to help product teams deliver securely, reliably, and quickly. This role leans toward cloud infrastructure, DevOps, and Site Reliability Engineering (SRE), with strong software development skills.

Requirements

Managed production-grade infrastructure on major cloud platforms like GCP.
Designed multi-region GCP networks using VPCs, subnets, firewalls, and NAT, managed with Terraform and GitOps.
Strong understanding of networking, IAM boundaries, and tradeoffs between managed services and self-hosted solutions.
Built production-grade Python tools or automation with structured, testable, and maintainable code.
Automated tasks like querying GCP Asset Inventory, generating IAM reports, and creating tickets with retry/error handling.
Operated GCP services like Cloud Run, Workload Identity, Secret Manager, and VPC Service Controls.
Applied GCP-specific reliability and security patterns with hands-on experience.
Supported internal developer teams by handling on-call rotations, resolving incidents, and delivering systemic fixes.
Managed production Kubernetes clusters, performed upgrades, configured policies, and debugged issues.
Configured HPA/VPA for autoscaling and troubleshot pod scheduling and service mesh connectivity.
Strong understanding of Kubernetes control planes for debugging and management
Cloud Platforms : Experience managing production-grade systems on GCP, AWS, or Azure with an SRE mindset.
Linux & Networking : Strong fundamentals in Linux, distributed systems, and debugging production issues.
Infrastructure as Code : Skilled in tools like Terraform, Helm, Kustomize, and GitOps practices.
Containers & Orchestration : Proficient in Docker, Kubernetes, and modern CI/CD tools.
Programming : Experience with languages like Python, Go, Java, or TypeScript for building tools and automation.
Communication : Clear communicator with effective incident leadership and a customer-first approach.

Nice To Haves

Wrote Go for platform tooling or infrastructure automation.
Developed Kubernetes admission webhooks to enforce security policies or CLI tools for secret management.
Produced idiomatic Go with proper error handling, context propagation, and unit tests.
Contributed to or led the design of multi-team or multi-service platform architectures.
Designed shared service networks (hub-and-spoke models), CI/CD templates, and service mesh configurations.
Documented architecture patterns adopted by teams and articulated tradeoffs in design reviews.
Implemented SRE practices, including SLIs, SLOs, and error budgets.
Configured SLO-based alerting in Prometheus/Grafana and used burn rate alerts for incident management.
SLI/SLO Expertise : Experience defining SLIs/SLOs and implementing SLO-based alerting and dashboards.
Observability Platforms : Familiarity with Prometheus/Grafana, OpenTelemetry, and centralized logging.
Security Practices : Knowledge of policy-as-code, supply chain security, SBOMs, and artifact signing.
Standardized Solutions : Experience creating reusable golden paths (e.g., container images, templates, pipelines).
Cost Optimization : Skilled in FinOps practices, capacity planning, and multi-tenant platform controls.
Go : Proficient in writing idiomatic Go for platform tooling or infrastructure automation.
Cloud Architecture : Experience designing multi-service or multi-team platform architectures.
Reliability Engineering : Practical implementation of SRE practices, including SLIs, SLOs, error budgets, and alerting.

Responsibilities

Design and Operate Cloud Infrastructure: Build and manage cloud platforms, including networking, compute, Kubernetes, CI/CD, secrets, and identity.
Define Reliability Metrics: Establish and enhance SLIs, SLOs, and error budgets.
Implement Observability: Set up metrics, logs, and traces with actionable alerts.
Automate Workflows: Develop self-service workflows (e.g., infrastructure as code, GitOps, CI/CD pipelines) to reduce manual efforts.
Enhance Security & Compliance: Drive least-privilege access, secure defaults, and policy-as-code.
Incident Management: Participate in on-call rotations, handle incidents, lead postmortems, and deliver fixes.
Collaborate with Teams: Partner with application teams to improve deployability, resilience, and cost efficiency.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume