Kraken-posted 4 months ago
Senior

As a Senior Site Reliability Engineer (SRE) specialized in Data Infrastructure, you will collaborate closely with diverse cross-functional teams to conceive, execute, and oversee the foundational data infrastructure that empowers our array of applications and services. You will play a pivotal role in upholding the reliability, scalability, and efficiency of our robust Data platform.

  • Design the data governance mechanisms that ensure our lakehouse is easy to interact with, secure and in compliance with all applicable regulations.
  • Implement the infrastructure we use to ingest our data, store it, catalog it with the right metadata and capture its lineage.
  • Provide a state-of-the-art suite of BI tools for multiple teams within the company.
  • Guarantee the availability, high performance, scalability and cost efficiency of our data platform.
  • Implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts.
  • Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform.
  • Develop and maintain automation scripts using bash/shell scripting to automate operational tasks and deployments.
  • Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure.
  • Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues.
  • Manage and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments.
  • Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC).
  • Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information.
  • Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration.
  • Implement effective incident response procedures and participate in on-call rotations.
  • Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions.
  • Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement.
  • Support AI/ML teams with their infra requests.
  • Proven experience (5+ years) working as a Site Reliability Engineer, Infrastructure Engineer, Data Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security.
  • Experience with maintaining real-time data processing technologies, such as Kafka and Flink clusters and Debezium instances.
  • Working experience in managing hybrid multi-tenant cloud systems particularly on AWS.
  • Infrastructure as Code tools such as Terraform, Terragrunt and Atlantis.
  • Experience with containerization and orchestration tools, particularly Kubernetes, Nomad, and Docker.
  • Solid understanding of bash/shell scripting and proficiency in at least one programming language (preferably Python or JVM languages).
  • Experience maintaining data-related technologies: Apache Airflow, Apache Spark, DBs, BI tooling.
  • Experience solving data access management issues at large scale data-lake.
  • Familiarity with CI/CD deployment pipelines and related tools.
  • Strong problem-solving skills and the ability to troubleshoot complex systems.
  • Experience with data-related technologies (databases, data lakes, airflow, spark) is a plus.
  • Fully remote work environment.
  • Opportunity to work with a world-class team.
  • Diverse and inclusive company culture.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service