Infrastructure / Site Reliability Engineer (SRE)

Solvd

About The Position

Solvd Inc. is a rapidly growing AI-native consulting and technology services firm delivering enterprise transformation across cloud, data, software engineering, and artificial intelligence. We work with industry-leading organizations to design, build, and operationalize technology solutions that drive measurable business outcomes. Following the acquisition of Tooploox, a premier AI and product development company, Solvd now offers true end-to-end delivery—from strategic advisory and solution design to custom AI development and enterprise-scale implementation. Our capability centers combine deep technical expertise, proven delivery methodologies, and sector-specific knowledge to address complex business challenges quickly and effectively. We are looking for a talented Infrastructure / Site Reliability Engineer (SRE) to join our engineering team. In this role, you will be the driving force behind our cloud infrastructure scalability, reliability, and deployment automation. You are an engineer who views infrastructure as a software problem. Instead of manually configuring servers, you build automated pipelines, treat Infrastructure as Code (IaC) as a religion, and architect self-healing cloud deployments. You will collaborate closely with development teams to bridge the gap between code generation and production stability.

Requirements

3+ years of experience in an SRE, DevOps, or Cloud Infrastructure role.
Deep production experience with at least one major cloud provider (AWS, GCP, or Azure).
Strong proficiency with Terraform and hands-on experience managing production Kubernetes clusters.
Solid understanding of Linux networking, internals, storage, and security fundamentals.

Nice To Haves

Strong coding skills in Go or Python.
Good grasp of VPC architecture, DNS, load balancers (ALB/NLB), and Content Delivery Networks (CDNs).
Familiarity with managing cloud-native databases (PostgreSQL, RDS) and caching layers (Redis, Memcached).

Responsibilities

Design, provision, and maintain secure, scalable, and highly available cloud infrastructure (primarily AWS, GCP, or Azure).
Write and maintain modular, clean Terraform or OpenTofu scripts to ensure all infrastructure is fully auditable and reproducible.
Manage and optimize containerized environments using Docker and Kubernetes (EKS/GKE), focusing on resource allocation and scaling policies.
Build, maintain, and secure robust CI/CD pipelines (e.g., GitHub Actions, GitLab CI, Jenkins) to support zero-downtime deployments.
Implement modern GitOps workflows (e.g., ArgoCD, Flux) to automate application delivery and configuration management.
Develop custom internal tools and automation scripts using Python, Go, or Bash to eliminate toil and repetitive manual tasks.
Design and implement comprehensive observability stacks using tools like Prometheus, Grafana, Datadog, or New Relic.
Conduct chaos engineering, load testing, and bottleneck analysis to ensure system resilience under heavy traffic.
Participate in an engineering on-call rotation, driving root-cause analysis (Blameless Post-Mortems) to prevent incident recurrence.

Benefits

Shape real-world AI-driven projects across key industries, working with clients from startup innovation to enterprise transformation.
Be part of a global team with equal opportunities for collaboration across continents and cultures.
Thrive in an inclusive environment that prioritizes continuous learning, innovation, and ethical AI standards.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume