About The Position

We are seeking a hands-on Site-Reliability Engineer to join our team. You will help architect, implement, and maintain our platform and our pipelines. You’ll partner closely with development, operations, and product teams to maintain and expand our PaaS offering.

Requirements

  • Are an SRE who understands how to operate modern distributed data systems on Kubernetes to be extremely reliable with predictable performance.
  • Have strong analytical and problem-solving skills.
  • Have a high degree of both autonomy and teamwork skills to function in a distributed team environment.
  • Have experience with (multiple) cloud service offerings, specifically from an operational perspective (we operate on GCP, AWS, and Azure today).
  • Have a passion for automating the complexities of orchestrating and running multi-tenant cloud application services.
  • Are accustomed to collaborating with business owners and understanding diverse business requirements.
  • Have five or more years of experience in distributed systems architecture and runtime requirements.
  • Are a voracious learner, ready to take on new technologies and techniques quickly and constantly.
  • Are skilful at interacting and working with people; working with a self-organized lean and agile team to mitigate project risks, manage effort, and ensure quality.
  • Are dedicated to best practices such as infrastructure as code, automated testing, code reviews, CI/CD, GitOps, and testing.
  • Are biased towards action on tough problems and issues, and focused on your customer’s success.
  • Are an agent of change, constantly learning and seeking better outcomes.
  • Are familiar with many of the supporting technologies we use, including Terraform, Crossplane, FluxCD, GitOps, Helm, Prometheus, Grafana, Actors, Service Mesh frameworks, etc.
  • Are experienced with complex and secure networking environments, including Encryption Keys and TLS.

Nice To Haves

  • Have knowledge of the Akka libraries for distributed systems, including Akka clustering.
  • Have supported SaaS/PaaS systems.
  • Have excellent written and verbal communication skills in at least English.
  • Have been at least exposed to policy-as-code and/or admissions controllers.

Responsibilities

  • Develop and extend software to monitor and improve end-to-end platform performance, identify runtime deficiencies, find potential failures, and fix production issues in a fully managed multi-cloud environment.
  • Participate in on-call rotation and incident-resolution.
  • Build deep, full-stack knowledge of our platforms and applications.
  • Work to simplify and automate deployment processes, run-time operations, and provide non-disruptive releases.
  • Help create and maintain an environment that provides security and privacy for our customers' data.
  • Maintain application reliability and uptime SLAs throughout the application lifecycle using programmatic self-healing and software automation.
  • Travel occasionally to meet with the rest of Akka's technical team.
  • Create and implement security policies as code to automate and enforce security controls.
  • Create comprehensive documentation for all configurations, processes, and procedures. Provide training and knowledge sharing with other team members.

Benefits

  • Competitive salary with performance-based incentives.
  • Remote-first, flexible work environment.
  • Comprehensive health and wellness benefits.
  • Opportunities for professional development and continuous learning.
  • Collaborative, inclusive, and innovative company culture.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service