Senior Cloud Site Reliability Engineer

Solace Corporation•Banglaore, IN

20h•Hybrid

About The Position

Enterprise AI is moving from pilots to production, and the constraint is no longer the model — it's the data. Agents are only as good as what they can sense, trust, and act on in the moment, and real-time, event-driven data is becoming the foundation every serious AI system runs on. Solace is the leading platform for the enterprise AI era. Established enterprises worldwide — including RBC Capital Markets, Bosch, Heineken, PSA Singapore, United Airlines, Schwarz Group, and hundreds more — have built their business around Solace to enable intelligent, real-time experiences, modernize their application and integration landscape, and create seamless digital journeys for their customers, partners, and employees. So, the next time you drive a car, order furniture online, fly in a plane, check your bank balance on your phone, your positive experience could be a direct result of our technology—and your hard work! About the Role This position is for a Senior Cloud Site Reliability Engineer. You will be responsible for the daily operations of Solace Cloud , our market-leading SaaS offering, across leading cloud providers and platforms such as Amazon Web Services, Microsoft Azure, Google Cloud Platform, Kubernetes, etc.

Requirements

Proven expertise with public cloud providers (AWS, Azure, GCP) services & features
Proven expertise with cloud Kubernetes infrastructure platforms such as AWS Elastic Kubernetes Service, Azure Kubernetes Service, Google Kubernetes Service
Hands-on experience with Monitoring tools like Datadog, Kibana, Prometheus etc.
Hands-on experience with Infrastructure Automation using Terraform, Cloud Formation
Hands-on expertise in debugging production alerts
Expert-level understanding of Linux Operating Systems
Programmer in languages such as Groovy, Python, and Go
Certified Kubernetes Administrator
Certified Cloud Administrator (AWS, Azure, or GCP)

Nice To Haves

Highly technical, excited by technology, and eager to stay up to date in a rapidly evolving environment.
Expert-level knowledge in Cloud Networking Solutions
Knowledgeable in demonstrating the ability to debug at a system level and resolve incidents in complex cloud-based environments
Expert in Site reliability engineering and Incident response
A strong communicator who can articulate complex technical issues clearly and concisely & get on the phone with customers.
Experienced in SaaS operations and customer-facing technical support

Responsibilities

Ensuring that the Solace Cloud Services are healthy and reliable, and that SLAs are being met
Design and implement our infrastructure tooling, observability, and automation
Contribute to making the production operations more efficient, less error-prone, etc.
Expert-level knowledge in handling production Incidents in production-grade multi-cloud environments according to industry-standard Incident management process
Process handling service requests and provisioning by the customers.
Proven ability to manage customer escalations and drive resolution in mission-critical, high-impact production environments
Work directly with customers to identify, troubleshoot, and resolve operational issues.
Expert debugging knowledge in Linux and Kubernetes to detect operational issues.
Be on-call rotation and provide 24x7 off-hours support