Site Reliability Engineer (SRE)

Valstro

13h•Remote

About The Position

Valstro is looking for a Site Reliability Engineer (SRE) to join our team! This person will help ensure the reliability, availability, and performance of our cloud-native trading platform. The role entails building and maintaining infrastructure, automating processes, and working closely with the Development and Platform teams to ensure seamless integration and deployment of the service. The successful candidate will serve as an essential link between the wider organization, executive leadership, and external vendors. Their responsibilities will include ensuring system reliability, building and maintaining monitoring solutions for both production and UAT systems, automating operational tasks, responding to incidents, and continuously improving systems and processes. This is a remote position that will report to the Site Reliability Lead.

Requirements

3+ years experience supporting Production level systems
Strong experience in site reliability engineering, systems engineering, or a related field.
Proficiency in cloud-based infrastructure (e.g. AWS, Azure, or Google Cloud.)
Experience with monitoring and logging tools (e.g., ELK, LGTM, Prometheus, Datadog).
Expertise in automation and scripting (e.g., Golang, Python, Bash, Terraform).
Knowledge of containerization and orchestration (e.g., Docker, Kubernetes).
Ability to effectively communicate and liaise between stakeholders, including internal teams, executive management and external vendors.
Strong troubleshooting and problem-solving skills.
Experience in establishing and enhancing reliability engineering practices and processes.
Capable of operating effectively in a dynamic organizational environment with high delivery and quality expectations.
A recent bachelor's degree in Computer Science, Software Engineering or related field
Knowledge of SREing
Knowledge of observability and tooling particularly the Grafana stack

Nice To Haves

Fintech

Responsibilities

Act as a key intermediary between engineering, executive leadership, and external vendors.
Ensure the reliability, availability, and performance of our cloud-based trading solutions.
Develop and maintain monitoring solutions to track system performance and reliability.
Automate operational tasks to improve efficiency and reduce manual intervention.
Collaborate with development teams to ensure seamless integration and deployment.
Respond to incidents and troubleshoot issues to minimize downtime.
Continuously improve systems and processes to enhance reliability and performance.
Participate in on-call rotations to provide 24/7 support for critical systems.