Production Operations Engineer

DryvIQ, Inc.

2d•Remote

About The Position

DryvIQ is a rapidly growing, venture-backed software company headquartered in the Ann Arbor tech cluster with a 90% remote workforce across all U.S. time zones. We help enterprises safeguard their most sensitive documents and content through intelligent, data-driven visibility and synchronization. We value curiosity, technical excellence, and collaboration. Our culture is strictly merit-based — We believe in recognizing and rewarding contributions based on impact and outcomes. Our culture values collaboration, initiative, and growth over politics or tenure, creating an environment where everyone has a fair chance to succeed. We also embrace pragmatic AI adoption: we use AI tools (including AI-assisted code generation) to speed up development and improve quality, but we are not zealots about it. We have no intention of replacing humans and believe the best solutions come from human creativity, experience, and judgment. Role Overview The Production Operations / SRE ensures reliability, security, and consistency across DryvIQ’s hybrid environments—spanning Azure, AWS, and on-prem customer installations (note: not a traditional multi-tenant SaaS environment). This role bridges engineering and operations, owning deployment automation, monitoring, and incident response for mission-critical data-management workloads. Because our codebase is sophisticated and mission-critical for global enterprise customers, we look for engineers who can ramp up quickly and make meaningful contributions to system architecture, performance, and maintainability.

Requirements

5 + years in SRE, DevOps, or Production Ops roles supporting hybrid or on-prem software delivery.
Minimum 3 years working with Fortune 500 companies implementing or maintaining enterprise software.
Expertise with Kubernetes, Helm, and Docker in mixed cloud environments (Azure AKS, AWS EKS, on-prem K3s).
Solid understanding of network security (proxies, TLS, VPN, firewalls) and Linux administration.
Strong scripting and automation skills (Bash, Python, PowerShell, YAML / Terraform) especially as it relates to K8s.
Familiarity with CI/CD pipelines (GitHub Actions, TeamCity, Argo CD or Flux).
Experience supporting distributed systems (e.g., Apache Pulsar, Postgres, ClickHouse, Redis, MinIO).
Comfort working directly with enterprise customer admins and security teams.

Responsibilities

Understand, deploy and maintain Helm charts, and CI/CD workflows for AKS, EKS, and on-prem Kubernetes (K3s or RKE2) in customer environments.
Standardize customer deployments (private cloud / air-gapped) using reproducible manifests and configuration validation tooling.
Maintain our single-node and multi-node install processes; improve installer packaging.
Monitor uptime, capacity, and performance across distributed clusters (migration, scan, OLAP DB node groups).
Implement proactive alerting (Prometheus, Grafana, Azure Monitor, CloudWatch) and ensure runbooks exist for all major services.
Coordinate with customer IT/security teams to handle firewall, proxy, and credential configurations safely and consistently.
Participate in release-readiness and hardening cycles; validate new images and helm charts before customer rollout.
Lead incident response for production issues—triage, communicate status, and drive post-incident reviews and root-cause documentation.
Track reliability metrics (MTTR, deployment success rate, change-failure rate) and feed insights back into engineering planning.
Integrate static/dynamic security scanning (GitHub Advanced Security / CodeQL / Dependabot) and image-signing pipelines.
Ensure secrets, credentials, and certificates are rotated and stored per corporate security standards.
Support ISO / SOC2 audit evidence collection (CCR change control, deployment logs, access reviews).
Extend monitoring to include customer-facing telemetry where allowed; maintain log shipping and retention policies.
Contribute to internal dashboards showing environment health, install duration, and customer success metrics.
Work closely with Dev / QA / Support to reproduce issues in controlled environments and publish fixes or workarounds.
Provide training and documentation for Services and Support engineers deploying or maintaining on-prem instances.
Champion “build-to-run” culture—drive automation, resiliency testing, and feedback loops between engineering and field ops.