Site Reliability Engineer
MoonPay
·
Posted:
August 29, 2023
·
Remote
About the position
MoonPay is seeking a Site Reliability Engineer to join their team. The SRE will be responsible for providing a resilient and secure platform for deploying applications and services. They will work on improving infrastructure, building monitoring mechanisms, load testing, and maintaining Kubernetes clusters. In the long term, the SRE will implement new technologies, automate processes, track metrics, and collaborate with other engineering functions. Strong systems administration skills and experience in platform engineering/SRE are required for this role.
Responsibilities
- Provide a resilient, secure, production-ready platform for deploying applications and services in a self-serve, repeatable manner.
- Support product delivery and operational teams by surfacing data from the production environment and driving meaningful change based on insights.
- Improve the maintainability of infrastructure as code.
- Build dashboards, monitoring, and alerting mechanisms using Datadog.
- Conduct load testing and performance tuning of production services.
- Lifecycling and maintenance of Kubernetes clusters.
- Implement new technologies on top of Kubernetes to ensure scalability.
- Develop and integrate automation solutions to improve reliability and facilitate recovery.
- Design and track metrics for site uptime and performance.
- Own deployment pipelines and continuously improve monitoring and alerting capabilities.
- Collaborate with other engineering functions to provide timely feedback.
- Support Engineering in delivering better software, faster, and more safely.
Requirements
- Strong systems administration skills
- Knowledge of the difference between a container and a virtual machine
- Familiarity with Linux terminal
- Platform engineering/SRE experience at leading startups or fast-growing tech companies
Benefits
- Resilient and secure production-ready platform
- Self-serve and repeatable deployment of applications and services
- Surfacing data from production environment and driving meaningful change
- Opportunity to work with a leading web3 infrastructure company
- End-to-end solutions for payments, smart contract development, and digital asset management
- Opportunity to work with iconic brands
- Increase resiliency and reliability of PaaS solution
- Building dashboards, monitoring, and alerting mechanisms
- Load testing and performance tuning of production services
- Lifecycling and maintenance of Kubernetes clusters
- Implementing new technologies on top of Kubernetes
- Automation solutions to improve and maintain reliability
- Design and track metrics for site uptime and performance
- Ownership of deployment pipelines and continuous improvement of monitoring and alerting capabilities
- Collaboration with other engineering functions
- Support in delivering better software, faster and more safely
- Strong systems administration skills
- Platform engineering/SRE experience
- Cross-training and upskilling opportunities
- Experience in regulated industry
- Experience in monitoring and logging of complex systems at scale
- Collaboration with different teams
- Opportunity to forge and own reliability and recovery processes
- Understanding of complex reliability structures, theories, principles, and best practices
- Experience with JavaScript codebases and frameworks (e.g., Typescript, Node.js, React)
- Emphasis on culture add and diversity experience
- Interview process with multiple stages
- Accommodations available for interview process.