Platform Engineer

Fleetworthy

11h

About The Position

Fleetworthy Inc. is hiring a Platform Engineer to own and evolve the systems that connect our cloud platform to the physical world. You'll work across AWS and Kubernetes environments, distributed Linux edge hardware, and the observability stack that ties it all together — building the reliability, automation, and operational visibility that lets the rest of engineering move fast with confidence. This role sits at the intersection of platform engineering, site reliability, and field operations. You'll be a key partner to development, data, and operations teams — and the person others reach for when something breaks in a complicated place. You'll treat the platform as a product, with internal engineering teams as your users, and you'll have meaningful ownership over how Fleetworthy ships and runs software.

Requirements

5+ years of experience in platform engineering, site reliability, DevOps, or infrastructure roles working with similar technologies at production scale.
Strong Linux fundamentals across Ubuntu and Debian environments — you're comfortable with systemd, nmcli, networking, package management, and embedded or edge hardware.
Proven Kubernetes experience — you can troubleshoot a broken pod, inspect events, tune resource scheduling, and trace a request end-to-end through a layered routing stack.
Hands-on Terraform, Ansible, and CI/CD experience — you write infrastructure as code and treat deployment reliability as a feature.
Observability fluency — PromQL, LogQL, Databricks SQL, and CloudWatch queries aren’t intimidating, and you build dashboards that are actually useful to the people reading them.
Solid networking foundation — DNS, TLS, TCP, ARP, firewalls, VPN behavior, and load balancer behavior in both cloud and physical environments.
Pragmatic and calm under pressure — you work effectively with legacy systems, vendor constraints, and incomplete documentation while steadily moving things forward.
Strong communicator — you write clear runbooks, explain complex systems to non-infrastructure teammates, and leave environments better documented than you found them.

Nice To Haves

Windows Server
SQL Server
SSRS
Datadog DBM
.NET worker services
Temporal

Responsibilities

Own and evolve our AWS infrastructure — EKS, EC2, ALB/ELB, ACM, IAM, ECR, and Auto Scaling Groups — with a focus on uptime, cost efficiency, and deployment safety.
Maintain Kubernetes platform health: deployments, ingress, HPA, secrets, Helm releases, and production incident response — you can trace a broken request through a public ALB, reverse proxy, internal ALB, and K8s ingress without breaking a sweat.
Partner with development teams to improve deployment pipelines, reduce manual steps, and raise the reliability bar across all environments.
Build and maintain dashboards, alerts, and telemetry pipelines using Grafana, Prometheus/Mimir, Loki, Grafana Alloy, Tempo, OpenTelemetry, Datadog, and CloudWatch.
Create actionable metrics, log views, and traces that help engineering and operations teams see what's happening — not just that something went wrong.
Write PromQL, LogQL, SQL, and CloudWatch queries that surface real signal, not noise — and build alert quality into the culture, not just the config.
Support distributed Linux-based hardware deployed in the field — physical servers, embedded devices, vendor integrations, and the data forwarding services that connect them to the cloud.
Troubleshoot connectivity, routing, ARP, DNS, firewall rules, VPN behavior, and TCP socket data flows in remote environments where recoverability matters as much as uptime.
Develop and maintain runbooks, automation, and configuration management practices that make field operations repeatable and resilient.
Own and improve Terraform and Terraform Cloud codebases, Ansible playbooks, Azure DevOps and GitLab CI/CD pipelines, and shell/Python automation.
Treat configuration drift, manual toil, and undocumented procedures as technical debt — and systematically pay it down.
Harden and document Linux systems across cloud and edge environments, with an eye toward consistency and safe repeatability.
Treat the platform as a product: build opinionated, well-supported workflows that help product teams provision services, ship code, and operate them in production without needing deep infra expertise.
Gather feedback from engineering teams, prioritize based on impact, and measure adoption and satisfaction of platform capabilities.
Partner with security to bake guardrails into the platform — secrets management, policy-as-code, supply chain security, and least-privilege defaults.