Member of Technical Staff (SRE)

Cockroach LabsNew York City, NY
1dHybrid

About The Position

Category-defining tech. Career-defining work. Lots of tech companies disrupt. But, many fail when they try to scale. We're different. CockroachDB makes it easier for companies to build and scale apps. This is how and why we're helping some of the most innovative companies on the planet. We tackle problems head-on and focus on solutions that create lasting impact. Because when our customers win, we all win. The Role CockroachDB provides the backbone of storing data on a global scale. Our core mission on the SRE team is to operate at scale a secure & reliable Cockroach Cloud product. We provide consultation, planning, architectural oversight, concrete designs, development, and implementation that improve the resilience, efficiency, performance, and availability of our Cloud Service. We also take pride in being good on-call engineers. We believe regular reflection on the experience of being on-call can contribute in the short, medium, & long term to improvements to the core product, including to CRDB itself. As a Site Reliability Engineer you’ll help manage and scale our CockroachCloud service, a fully managed global offering of CockroachDB spanning multiple cloud providers. You will oversee our production system, ensuring that we can provide stable and scalable infrastructure as we deliver CockroachDB to our customers.

Requirements

  • Expertise in analyzing, monitoring, and troubleshooting large-scale distributed systems.
  • Experience in software development using one or more of the following: Go, C, C++, Python, Java.
  • Proficiency working with algorithms, data structures, and production troubleshooting.
  • Expertise in working with major cloud providers (AWS, Azure, GCP, etc.) and Cloud APIs.
  • Debugged and optimized code and to automate routine tasks.
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc.)
  • Prior on-call experience, exhibiting sense of ownership, attention to detail, and urgency.
  • Experience building collaborative relationships with your colleagues.
  • You enjoy being part of the code review process and partnering with your teammates on challenging problems.

Responsibilities

  • Manage the infrastructure for cloud services, including running internal production systems and hosting CockroachDB for our external customers.
  • Design, write and deliver software and systems to increase product reliability and operational efficiency.
  • Develop custom tools as necessary.
  • Keep a complex system running and solve problems relating to mission-critical services.
  • Design, implement, operate, and troubleshoot the automation and monitoring of production clusters to maximize performance and availability.
  • Drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test its overall resilience to failures.
  • Participate in an on-call rotation for our production systems and hosted services.

Benefits

  • Stock Options
  • Medical Insurance
  • Vision Insurance
  • Dental Insurance
  • Life and Disability Insurance
  • Professional Development Funds
  • Flexible Time Off
  • Paid Holidays
  • Paid Sick Days
  • Paid Parental Leave
  • Retirement Benefits
  • Mental Wellbeing Benefits

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service