Engineer II

DigitalOcean•Denver, CO

280d•$89,700 - $112,100•Remote

About The Position

We want people who are passionate about designing and operating secure systems at scale. We are looking for an experienced, motivated, adaptable, empathetic engineer who is comfortable working remotely and has SRE skills. You will report to the Engineering Manager of the Availability team, and act as a contributor to the team's mission: Improve customer happiness and retention by driving availability and process improvements across the company.

Requirements

Participate in an on-call rotation to respond to critical incidents and ensure the continuous availability of services.
Deeply invested in maintaining and improving the uptime and overall health of cloud infrastructure and applications.
Identify and automate repetitive tasks, infrastructure provisioning, deployments, and monitoring processes.
Contribute to the design and implementation of scalable and resilient systems.
Experience using or administering Linux systems, primarily Ubuntu.
Experience reading, writing, and debugging code in any language, with a focus on adaptability.
Familiarity with incident management and prior experience at a NOC or doing triage.
Familiarity with shell and git.
Familiarity with continuous integration systems and concepts.
Experience leveraging monitoring systems for data-driven outcomes.
Comfortable executing in an asynchronous remote environment.
Transparency, honesty, and openness to constructive feedback.
A desire to work with a respectful and inclusive team.

Nice To Haves

Familiarity with Github Actions or Concourse.
Experience with monitoring systems like Grafana, VictoriaMetrics, Looker, Elasticsearch.

Responsibilities

Reduce incident duration and frequency by taking an active role in incident management, trimming down bloated processes, and leading cross-team efforts.
Create meaningful metrics and dashboards by defining and refining relevant metrics, building informative dashboards, and establishing effective alerting thresholds.
Automate repetitive tasks related to monitoring, reporting, and other tasks the Availability team handles.
Spend 1-2 days a week on-call including shift work during set hours on those days.
Drive the mitigation and resolution of incidents, as well as handling incident reviews/postmortems.
Improve toilsome availability-related processes by tweaking, rewriting, and introducing processes that have company-wide impact.
Identify opportunities for improvement in monitoring, alerting, incident resolution, and other processes around the organization.
Communicate incident status clearly to customers and write clear, accurate reports intended for public consumption.
Embed directly with service teams and dive into unknown codebases written in languages you're not familiar with.
Communicate internally with engineers and respond to Slack messages while keeping up with various streams of conversation.

Benefits

Competitive array of benefits to support overall well-being.
Reimbursement for relevant conferences, training, and education.
Access to LinkedIn Learning's 10,000+ courses.
Salary range between $89,700.00 - $112,100.00 based on market data, relevant years of experience, and skills.
Potential for a bonus based on company and individual performance.
Equity compensation to eligible employees, including equity grants upon hire and participation in the Employee Stock Purchase Program.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Publishing Industries

Number of Employees

1,001-5,000 employees

Engineer II

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company