Principal DevOps Engineer

Northwoods•Dublin, OH

About The Position

The Principal DevOps Engineer serves as the operational leader for the Product Management function, establishing clear ownership, coaching Product Managers, improving operating rhythms, and building a high-performing product organization that consistently translates strategy into execution. This role owns the clarity, reliability, security, and repeatability of how our systems are built, deployed, and operated. This role designs and maintains automated, scalable, secure, and cost-effective infrastructure across production, development, and test environments, while advancing the internal platform capabilities engineers rely on every day, including the shared AI and LLM infrastructure that supports modern product delivery. This is a deeply hands-on role responsible for executing and improving deployments, observability, AI-enabled delivery tooling, and core operational practices to reduce risk caused by opaque processes, undocumented knowledge, and single points of failure. The DevOps Engineer Lead turns deployment and infrastructure from siloed knowledge into understandable, well-documented, observable systems that teams can confidently use and improve, including platform patterns for safe and reliable LLM usage. The role leads through practice by mentoring engineers, establishing standards, improving processes, and removing operational obstacles. Working independently and in close partnership with Engineering, this role reduces operational burden, increases delivery confidence, and builds platform capabilities that scale reliably. This role provides technical leadership through ownership and execution and does not include formal people-management responsibilities.

Requirements

6-10 years of hands-on experience in DevOps, infrastructure, or platform engineering supporting production systems at scale, with experience enabling modern automation-heavy engineering organizations.
Strong, hands-on experience operating workloads in AWS, with responsibility for reliability, security, performance, and day-to-day operations across services that include both traditional application workloads and AI-enabled platform capabilities.
Proven production experience with Kubernetes, including deploying, operating, and troubleshooting containerized workloads
Strong programming experience with Python (or similar), with the ability to write and maintain production automation and work fluently in code-first operational workflows.
Deep hands-on expertise with Terraform and infrastructure-as-code practices, with experience using broader DevOps tooling such as CloudFormation and Ansible.
Strong proficiency with git based source control, including code reviews, collaborative workflows, and infrastructure/code ownership
Extensive experience building, operating, and improving CI/CD pipelines for provisioning, deployment, scaling, and automated verification, including practical integration of AI-assisted tooling in delivery workflows.
Strong Linux/Unix expertise, including administration, scripting, troubleshooting, and operational monitoring in production environments
Hands-on experience implementing monitoring and log aggregation platforms (ELK, Graylog, Graphite, Prometheus, etc.)
Experience implementing test automation and AI-assisted tooling to improve deployment quality, reliability, and operational efficiency, including workflows that use LLM-based assistants responsibly.
Experience deploying and managing application infrastructure such as web or application servers, load balancers, queues, and caches, with an emphasis on scalability, resiliency, and operational transparency, plus familiarity with shared gateway or proxy patterns for external AI/LLM services.
Must be authorized to work in the U.S.
Strong ability to understand and operate systems end-to-end, including application architecture, infrastructure, deployment workflows, production operations, and AI/LLM service integration patterns.
Proven ability to troubleshoot and resolve complex production issues across infrastructure, CI/CD pipelines, Kubernetes, and runtime environments.
Strong understanding of observability practices, including metrics, centralized logging, alerting, tracing, and root-cause analysis, with the ability to extend observability and diagnostics to AI-enabled services.
Deeply hands-on operator with sound technical judgment; able to assess situations quickly and clearly recommend solutions (what we should do and why).
Strong sense of ownership and accountability, with the ability to prioritize work that improves reliability, reduces risk, controls cost, and ensures follow-through across both core infrastructure and AI-enabled platform services.
Ability to collaborate effectively with software engineers and communicate clearly with both technical and non‑technical stakeholders
Ability to lead through influence by pairing, mentoring, documenting, and establishing practical standards across platform and delivery engineering.
Self-starter comfortable operating in environments where structure must be built, not inherited, with a focus on clarity, measurable outcomes, and execution.
Strong security mindset, with hands-on experience in secrets management, access controls, encryption, patching, vulnerability management, and secure service integration, including third-party AI and LLM providers.
Hands-on experience with network topology, including the ability to configure and troubleshoot site-to-site VPNs, firewall rules, and hybrid-cloud connectivity.

Nice To Haves

Hands-on experience with networking concepts such as VPNs, firewall rules, or hybrid-cloud connectivity, and ability to apply these concepts to secure AI service integrations.
Security experience in regulated or compliance-driven environments (e.g., SOC 2 and HIPAA familiarity), including governance considerations for AI and LLM-backed workflows.
Database administration experience, including performance and reliability fundamentals.
Experience supporting or deploying 12-Factor applications, internal developer platforms, or AI-enabled engineering workflows; experience with LLM gateways, model routing, or usage/cost controls is a strong plus.

Responsibilities

Own day-to-day DevOps operations, including infrastructure health, monitoring, logging, patching, security posture, and maintenance, ensuring systems are observable and failures are diagnosable through strong metrics, logging, root-cause visibility, and effective incident response.
Own and execute deployment processes end-to-end, ensuring they are secure, repeatable, transparent, and well documented with clear failure signals, automated rollback strategies, and release evidence that supports fast, safe decision-making.
Design, build, and maintain automated, scalable, secure, and cost-effective infrastructure across production, development, and test environments using infrastructure-as-code and platform engineering best practices, including shared services that enable AI-assisted and agentic engineering workflows.
Build, operate, and continuously improve CI/CD pipelines with reliable verification stages, clear failure signals, recovery paths, and rollback strategies, including automation hooks that support AI-enabled development workflows without weakening quality gates.
Own application-level networking and infrastructure concerns, including network configuration, access controls, and connectivity required to support development and production environments, including secure connectivity for AI and LLM-backed services.
Own infrastructure and networking concerns, including the configuration and troubleshooting of site-to-site VPNs, firewall rules, and secure connectivity required for county-level integrations and remote access.
Perform regular access analysis across all systems, managing secrets, credentials, and IAM roles to ensure strict adherence to security best practices, including secure handling of AI provider credentials and service tokens.
Proactively support compliance requirements (such as SOC 2 and HIPAA) by maintaining auditable operational practices and generating technical evidence and reports for software and security audits, including traceability of AI and LLM service usage where required.
Enforce security posture through proactive patching, encryption, and vulnerability management across web servers, load balancers, data stores, runtime dependencies, and AI integration surfaces.
Partner with software engineers during deployments and operational work to build shared understanding and enable safe, independent troubleshooting.
Deploy, manage, and scale web and application servers, load balancers, queues, and caches through automated, repeatable workflows, and provide robust platform primitives for AI-enabled services and internal engineering automation.
Identify, prioritize, and deliver improvements that reduce operational risk, remove bottlenecks, improve efficiency, increase delivery confidence, strengthen engineering throughput, and improve cost visibility across both cloud and AI usage.
Document systems and processes with a focus on explaining both how they work and why, including clear runbooks and operational standards.
Take proactive ownership of workload while ensuring strong coordination and transparency across the team, and coach engineers on practical use of platform, infrastructure, and AI-enabled engineering tools, patterns, and guardrails.
Perform other job-related duties as assigned to support departmental goals and continuous improvement initiatives.

Benefits

Medical (includes H.S.A. option with employer contribution), dental, and vision insurance
Short- and long-term disability
Company paid basic life insurance
401(k) with 4% company match and immediate vesting
Free financial education and consultation
Wellness program that helps you earn lower premiums
Robust EAP program that includes free therapy sessions, lifestyle coaching, legal/ID theft services, and more
12 weeks fully paid parental leave
Up to $5,000 adoption fee reimbursement
$500 wellness reimbursement after 60 days of employment
Generous PTO policy and 10 company paid holidays
Company paid cell phone plan