Software Engineer III (Cloud SRE Engineer)

Cleerly

43d•Remote

About The Position

We are seeking a highly skilled, experienced Site Reliability Engineer (SRE) to join the core technical team of our growing next-generation enterprise-level imaging platform. In this critical role, you will be primarily focused on the health and integrity of our systems, ensuring repeatable deployments and the stability of new product streams within AWS. Your responsibilities will include implementing and maintaining observability, incident readiness, and secure connectivity to third-party applications. The SRE Engineer will collaborate closely with Product, Program, Software, and SQA Engineers to build the next generation product for heart disease diagnosis. Furthermore, the SRE will provide IT support to Cleerly's development partners and customers. The ideal candidate possesses strong problem-solving skills, with the ability to troubleshoot and address system and network issues promptly. We are seeking an individual who is: A team player with an appetite for hands-on work. A highly motivated self-starter who is detail oriented. Demonstrates strong ownership, accountability, and commitment to high-quality deliverables.

Requirements

Experience & Scale: 6–10+ years of professional experience running and managing production services on AWS.
AWS Mastery: Deep understanding of core AWS fundamentals, including VPC networking, IAM, KMS, security groups, and routing.
Infrastructure as Code (IaC): Expertise with Infrastructure-as-Code (Terraform, CDK, or CloudFormation) and reliable environment replication.
Container Platforms: Experience operating and managing container platforms (EKS/ECS) and/or scalable managed services.
CI/CD Automation: Proven ability to design and automate comprehensive CI/CD pipelines (builds, tests, deploys, and rollbacks).
Observability: Deep knowledge of metrics, logs, and traces, along with setting SLOs, configuring robust alerting, and managing structured incident response processes.
HA/DR: Practical High Availability (HA) / Disaster Recovery (DR) thinking, including backup strategies, multi-AZ patterns, and conducting failure drills.
Security Posture: Strong security-by-default posture, including expertise in secrets handling, key rotation, and the principle of least privilege.
Performance & Cost: Acute performance and cost awareness, including effective use of tagging, budgeting, right-sizing, and autoscaling.
Partnership: Proven ability to partner with engineering and security teams to achieve rapid deployment goals without compromising system reliability.
Alerting & Reporting: Prometheus, Grafana, AWS CloudWatch Insights.
Deployment: GitHub, containers (Kubernetes, Docker).
Encryption: Encryption technologies at Rest and in Transit.
IaC: Terraform, CloudFormation.
Integration: VPN, virtualization, edge computing.
Pipeline Orchestration: Apache Airflow.
Scripting: Python, Bash.
Bachelor’s degree in computer science, Information Technology, or a related field, or equivalent experience.
Proven experience in Site Reliability Engineering, DevOps, or a similar role.

Nice To Haves

SDLC for SaMD: Expertise in the Software Development Life Cycle (SDLC) specifically for software medical devices (SaMD).
Regulated Environment: Deep experience operating in regulated environments, managing audit logs, strict change control, and comprehensive evidence collection.
Medical Imaging Standards: Working knowledge of essential medical imaging standards, including DICOM and HL7.
Cybersecurity & Data Privacy: Proven experience developing comprehensive cybersecurity measures and implementing robust data protection and privacy controls across cloud infrastructure.
Secure Connectivity: Experience designing and implementing secure connectivity patterns for healthcare customers, including PrivateLink, VPN, and Direct Connect.
Container Security: Expertise in container supply-chain security, including SBOM (Software Bill of Materials), signing, scanning, and runtime policy enforcement.
AWS Certification: AWS Certified SysOps Administrator – Associate or Professional.
Kubernetes Certification: Certified Kubernetes Administrator (CKA).

Responsibilities

Cloud Environment Buildout: Stand up and harden the new Hub cloud environment and deployment pipeline, ensuring reliability, security, and repeatability.
Infrastructure Management: Design, develop, and manage cloud infrastructure using AWS services, Terraform (Infrastructure as Code), and Docker containers.
System Integrity: Use strong system administration and network engineering skills to ensure the reliability, scalability, and performance of all platform systems.
Own Observability & Incidents: Own observability and incident readiness end-to-end, including third-party connectivity patterns, runtime guardrails, and defining upgrade strategies (canary/rollback). This ensures the platform can scale safely as new AI integrations are added.
Drive DevOps Automation: Implement DevOps methodologies and tools, facilitating Continuous Integration (CI), Continuous Delivery (CD), and the automation of infrastructure management tasks.
Reduce Toil: Develop and maintain automation tools to proactively reduce manual operational tasks (toil).
Security Maintenance: Ensure system and network security is always maintained by implementing and enforcing appropriate security measures across the platform.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume