Sr. Software Engineer II (DevOps)

Hi Marley•Boston, MA

50d•$119,000 - $221,000•Hybrid

About The Position

Hi Marley is seeking a Sr. Software Engineer II (DevOps) to join their team. This role will focus on building and scaling the infrastructure for their core platform and rapidly growing agentic AI services. The position is at the intersection of cloud infrastructure, AI operations, and platform engineering, aiming to ensure reliable operation at enterprise scale while deploying autonomous AI agents in regulated insurance workflows. The engineer will also be responsible for setting infrastructure standards, driving technical decisions, and mentoring less experienced engineers. This role requires 2-3 days per week in the Boston office.

Requirements

6+ years of DevOps/SRE/Platform Engineering experience
2+ years of experience building or operating AI/ML infrastructure (model serving, inference, LLM orchestration, or agentic systems)
Bachelor’s degree in Computer Science, Engineering, or equivalent experience
Experience building and operating infrastructure for traditional and AI or ML workloads at a SaaS company
Deep experience with AWS cloud services (ECS, Lambda, SageMaker, Bedrock, S3, DynamoDB, Redshift, or equivalent)
Strong infrastructure-as-code skills with Terraform and understand how to manage state, modules, and multi-environment configurations
Understanding of data infrastructure: pipelines, warehousing, ETL/ELT, and how to support analytics at scale
Experience with compliance-sensitive environments and understanding of audit trails, access governance, and change management
Comfortable operating in a fast-moving environment where AI capabilities are evolving rapidly and infrastructure decisions have regulatory implications
Strong proficiency in at least one programming language (Python, Go, TypeScript, or similar)

Nice To Haves

Naturally step up to lead technical conversations, and people across teams seek you out when infrastructure decisions get complicated
Think about observability as more than dashboards — you care about data integrity, SLOs, error budgets, and catching silent failures
Communicate well with both engineering and non-technical stakeholders
Track record of leading cross-team technical initiatives and mentoring engineers on infrastructure and operational best practices
Experience with container orchestration (ECS, EKS)
Experience with monitoring and observability platforms (Datadog, CloudWatch)
Experience with data infrastructure (Redshift, or similar data warehousing; Airflow, dbt, Dagster or similar pipeline tools) is a strong plus
Experience in regulated industries (insurance, financial services, healthcare) is a strong plus
A genuine curiosity about AI and emerging technologies, paired with the judgment to apply them thoughtfully and responsibly

Responsibilities

Design and operate cloud infrastructure on AWS that supports both our core SaaS platform and our agentic AI services, ensuring reliability, scalability, and cost efficiency
Build and maintain AI/ML infrastructure and monitoring for LLM-powered agentic services
Establish and enforce infrastructure-as-code standards using Terraform, defining the patterns other engineers follow for environment parity, drift detection, and automated compliance validation
Implement observability beyond availability — data integrity monitoring, SLO frameworks with error budgets, and automated regression detection for both platform and AI services
Build deployment automation including pre-deployment verification, migration script validation, and codified rollback procedures to eliminate human-memory dependencies
Support big data infrastructure: data pipelines, warehousing (Redshift), and analytics tooling that enables reporting, BI, and AI training workflows
Implement security and compliance controls for AI workloads operating in regulated carrier environments — including audit logging, access governance, and configuration management
Drive environment parity across all infrastructure with automated drift detection and remediation
Improve disaster recovery capabilities: documented and rehearsed DR procedures, defined RTO/RPO by service tier, and tested recovery runbooks
Lead architecture reviews for new services, integrations, and AI agent deployments — partnering with engineering, product, and security to ensure infrastructure decisions are sound before they ship
Innovate on developer experience: reduce friction in testing environments, CI/CD pipelines, and local development workflows
Act as a technical anchor for infrastructure decisions across teams — providing clarity when requirements are ambiguous and helping the organization converge on consistent, scalable approaches

Benefits

Equity grants for all employees
A 4% matching 401(k) program
Medical, dental, vision, disability, and life insurance coverage for employees working 30+ hours per week
Monthly wellness stipend
Paid parental leave
A flexible vacation policy

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume