Senior Engineer, DevOps/Platform Reliability

PayCargo•Miami, FL

About The Position

The Senior Engineer, DevOps/Platform Reliability is responsible for building and operating the infrastructure, pipelines, and platform standards that keep PayCargo's global payments platform reliable, observable, and supportable. The role spans the full platform – EC2-based services, scheduled jobs, and file processing alongside containerized (ECS/Fargate) and serverless (Lambda) workloads – across a multi-account AWS environment, Terraform, and a GitHub and ZenHub workflow that ships through GitHub Actions and GitHub OIDC, with a focus on modernizing how PayCargo builds, deploys, and runs software. As one example, PayCargo's SFTP runs on AWS Transfer Family with a Lambda identity provider. This is a hands-on individual contributor role. The Senior Engineer, DevOps/Platform Reliability modernizes legacy scheduled jobs and file processes into containerized, observable services, codifies infrastructure as repeatable Terraform patterns, and creates standards that other developers can follow without depending on a single person for every implementation. The role requires strong judgment, strong follow-through, and a focus on reducing reactive fire drills and single points of failure. Working within PayCargo's DevSecOps model, the Senior Engineer, DevOps/Platform Reliability partners closely with Security, Engineering, Architecture, Product, Support, and executive stakeholders to deliver scalable, secure, and repeatable platform execution. This position has no direct reports. The role leads indirectly by defining infrastructure and deployment standards, guiding engineers toward repeatable patterns, and reducing single points of failure across the platform.

Requirements

5+ years of hands-on DevOps, platform, or infrastructure engineering experience preferred
Strong experience with AWS (ECS/Fargate, Lambda, VPC, IAM), and working knowledge of Azure or Entra ID
Hands-on experience with infrastructure-as-code using Terraform, including reusable modules, remote state, and plan/apply in CI
Strong experience with Docker and container orchestration such as ECS/Fargate and ECR
Experience building and maintaining CI/CD pipelines, preferably with GitHub Actions, including OIDC-based cloud authentication
Experience with monitoring and observability tooling such as CloudWatch, SNS, Sentry, and Athena/Glue
Strong understanding of secrets management (Secrets Manager, SSM Parameter Store), environment configuration, and secure deployment
Strong troubleshooting, incident response, and root cause analysis skills
Ability to create repeatable standards and documentation that reduce single points of failure
Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field, or equivalent practical experience
Demonstrated experience operating production infrastructure and CI/CD in cloud environments
Experience with containerization, infrastructure-as-code, and observability tooling

Nice To Haves

Experience modernizing legacy scheduled jobs and file-processing workloads into containerized services
Experience operating both EC2-based services and containerized or serverless workloads
Experience with disaster recovery, multi-region (us-east-1 / ap-east-1) redundancy, and failover design
Familiarity with secure AI/LLM platform patterns, whitelisted egress, and bounded environments
Experience with on-call workflows and tooling such as PagerDuty
Familiarity with zero-trust network access (Tailscale) and SSM Session Manager in place of bastion hosts
Experience in payments, fintech, SaaS, or other high-volume transactional environments
Familiarity with SOC and PCI control requirements as they relate to infrastructure

Responsibilities

Modernize legacy scheduled jobs, cron scripts, and file processes into containerized (ECS/Fargate), observable, supportable services
Build and maintain infrastructure patterns in Terraform, with reusable modules, remote state, and plan/apply through CI
Standardize environment configuration, secrets management (Secrets Manager and SSM Parameter Store), and repeatable deployment paths across environments and accounts
Create platform standards that other developers can follow without depending on DevOps for every implementation
Build, maintain, and harden CI/CD pipelines integrated with GitHub and ZenHub, with deployments authenticated through GitHub OIDC to eliminate static cloud credentials
Improve build, test, and deployment automation to make releases faster, safer, and more repeatable
Establish rollback, promotion, and environment-promotion practices that reduce release risk
Embed security and quality gates into pipelines in partnership with Security and Engineering
Implement and maintain monitoring, logging, and alerting using CloudWatch, SNS, and Sentry, with log analytics through Athena and Glue
Improve telemetry, dashboards, and on-call workflows (PagerDuty) so issues are detected and resolved quickly
Support disaster recovery, backup, and failover patterns across regions and accounts
Lead incident response and root cause analysis with clear, durable follow-up
Support the infrastructure for a contained AI platform, including whitelisted egress and approved deployment paths
Help operationalize controls such as stateless model access and bounded environments in partnership with Security and Architecture
Build deployment and monitoring patterns for AI-assisted applications so they are observable and supportable
Partner with Security to embed controls into pipelines, environments, and infrastructure-as-code, including OIDC roles, least privilege, mTLS, and Tailscale-based access
Work with Engineering and Architecture to translate designs into runnable, supportable infrastructure
Advise Product and Support on operational realities, trade-offs, and delivery risk
Implement and operate the infrastructure, pipelines, and environments according to the standards and architecture owned by the VP, Infrastructure & Security
Provide clear status, escalate risks early, and document infrastructure, pipelines, and runbooks