Senior DevOps & Infrastructure Lead

Rv Life•Dallas, TX

1d•Remote

About The Position

RV LIFE is looking for a Senior DevOps & Infrastructure Lead to help us stabilize, document, and modernize the infrastructure behind our products. This is a hands-on senior role for someone comfortable inheriting real production systems, reducing operational risk, improving reliability, and moving us toward a documented, secure, automated, infrastructure-as-code operating model. We run production across DigitalOcean, AWS, Cloudflare, and other hosting providers, and are consolidating onto managed, infrastructure-as-code platforms. We need deep, hands-on expertise across these environments. RV LIFE is an AI-first engineering organization. We expect this person to use AI to accelerate discovery, documentation, runbooks, log review, scripting, and infrastructure-as-code drafting, while applying strict human judgment around security, secrets, production access, destructive commands, rollback, and correctness. This role focuses on the infrastructure path to reliability; application-level architecture changes are handled in partnership with our engineering team. It is not just about keeping servers alive. It is about building durable practices that reduce single-person dependency, improve visibility, and make our systems safer to operate. This is not a standard 9-to-5 role. Production issues do not keep business hours, so it carries real on-call responsibility: you need to be reachable and able to respond when unforeseen incidents arise.

Requirements

Senior-level experience operating production infrastructure.
Deep, hands-on Linux server administration (the traditional, "old-school" kind) : operating, securing, and troubleshooting manually managed production servers (LAMP/ LEMP , system services, cron, networking, SSH) directly at the command line, not only through a cloud console.
Experience with DigitalOcean, Linode, AWS EC2, bare VPS hosting, or comparable environments.
Senior database operations : migrating self-managed MySQL to a managed service, replication, backup validation, restore testing, and IO isolation.
Strong Cloudflare across DNS, WAF, CDN and caching behavior, page rules, Workers, Pages, and Zero Trust/Access, including traffic routing and origin protection.
PHP/Laravel application environments, and experience with a managed Laravel runtime (Laravel Cloud and/or DigitalOcean App Platform).
Datadog or a comparable observability platform for monitoring, alerting, dashboards, logs, and incident investigation.
Infrastructure-as-code such as Terraform, Pulumi, AWS CDK, Serverless Framework, or CloudFormation.
CI/CD pipelines and deployment automation.
Practical AWS experience (Lambda, IAM, VPC, CloudWatch, S3, SSM /Secrets Manager, queues).
Good judgment around production safety, access control, secrets, backups, and incident response.
Willingness to carry real on-call responsibility and respond to production incidents outside normal business hours; this is not a strict 9-to-5 role.
A habit of documenting what you learn and creating runbooks others can follow.
Practical experience using AI tools (ChatGPT, Claude, Cursor, GitHub Copilot, or similar), with strong judgment about where human verification is required.
Ability to work independently in a small, remote engineering organization where practical ownership matters more than bureaucracy.

Nice To Haves

Experience migrating manually managed services onto managed platforms or IaC.
Experience moving static frontends onto Cloudflare Pages.
Managed migrations for MongoDB, OpenSearch, or Valkey/Redis.
Experience supporting Node.js, React, and React Native alongside PHP.
Experience helping organizations reduce infrastructure bus-factor risk.
Experience working with external DevOps/security partners or auditors.

Responsibilities

Administer and improve existing DigitalOcean infrastructure.
Support and improve Linux-based production server environments.
Migrate self-managed databases onto managed database services, with validated failover, backups, and recovery.
Move applications onto managed runtimes (including Laravel Cloud where it fits), replacing manual deploy processes with automated, repeatable pipelines.
Expand and harden our use of Cloudflare for edge, static hosting, caching, and security.
Build a clear inventory of servers, services, databases, domains, access paths, backups, monitoring, and operational risks.
Create and maintain practical runbooks for common and emergency infrastructure workflows.
Improve incident response, escalation paths, monitoring, logging, and alerting.
Review and improve backup, restore, and disaster-recovery procedures.
Identify recurring manual work and convert it into safer procedures, scripts, automation, or infrastructure-as-code.
Help define infrastructure-as-code standards and move appropriate infrastructure into repeatable, version-controlled workflows.
Work with AWS services where needed (Lambda, VPC, IAM, CloudWatch, S3, SSM /Secrets Manager, queues).
Use AI tools to accelerate discovery, documentation, scripting, troubleshooting, and automation, with strong production-safety judgment.
Partner with engineering leadership to prioritize infrastructure risk and modernization; track work clearly in Jira/GitHub and communicate proactively about risks, tradeoffs, and blockers.