Senior Site Reliability Engineer

LineVision, Inc.Boston, MA
just nowHybrid

About The Position

Build the foundation of reliability that powers the electric grid of tomorrow. We are seeking a Senior Site Reliability Engineer to establish LineVision's dedicated SRE practice and ensure our grid intelligence platform delivers the exceptional reliability our utility customers depend on. If you are looking to own the development of critical systems observability, deployment processes, and incident response protocols that directly impact grid operations, join us at LineVision, Built In Boston Best Places to Work!

Requirements

  • AWS Expertise: Strong experience with core AWS services including EC2, RDS, Lambda, and networking/VPC configuration for production environments
  • Observability & Monitoring: Hands-on proficiency with tools like Datadog, Prometheus, Grafana, or CloudWatch for instrumenting distributed systems
  • Infrastructure as Code: Experience with Terraform, CloudFormation, or Pulumi for managing and versioning infrastructure
  • Programming: Python and TypeScript experience for automation, tooling, and system instrumentation
  • SLO/SLA Frameworks: Demonstrated experience establishing Service Level Objectives and tracking error budgets
  • Critical Thinking: Lead problem-solving efforts around complex reliability challenges, consistently applying critical thinking to identify root causes and prevent future incidents
  • Taking Ownership: Lead reliability projects with minimal supervision, taking full ownership of SRE practice development and system observability outcomes
  • Stakeholder Management: Manage relationships across engineering, platform, and support teams, providing clear updates on reliability metrics and leveraging influence to align on SRE priorities
  • Delivering Innovative Solutions: Lead implementation of modern SRE practices, inspiring teams to think creatively about reliability challenges in utility infrastructure context

Nice To Haves

  • Background in energy, utility, or critical infrastructure sectors where reliability directly impacts public services
  • AWS certifications demonstrating deep platform expertise
  • Experience with security compliance frameworks (NERC CIP, ISO 27001, SOC 2) relevant to utility operations
  • Track record of building SRE practices from the ground up in fast-growing technical organizations

Responsibilities

  • Establish and maintain Service Level Objectives (SLOs) and observability frameworks for critical services supporting utility grid operations
  • Implement CI/CD guardrails including canary deployments, automated rollbacks, and pre-production validation to improve deployment reliability
  • Develop comprehensive incident response procedures with documented runbooks, escalation paths, and blameless post-incident review processes
  • Partner with platform, engineering, and customer support teams to instrument systems and build reliability capabilities where they deliver maximum impact
  • Design and implement monitoring dashboards tracking SLA compliance, reliability metrics, and error budgets
  • Complete comprehensive assessment of LineVision's current infrastructure, identifying critical services requiring immediate observability improvements
  • Establish baseline SLOs for top-priority services and implement initial monitoring dashboards in partnership with platform and support teams
  • Document current deployment processes and incident response procedures, identifying gaps and quick-win improvements
  • Deploy production-ready observability framework covering all critical customer-facing services, with alerts configured for key reliability signals
  • Implement CI/CD improvements including automated testing gates, canary deployments, and rollback capabilities for core platform services
  • Lead 3+ blameless post-incident reviews, establishing templates and processes that become standard practice across engineering
  • Achieve measurable improvements in deployment success rates and mean time to recovery (MTTR) through implemented SRE practices
  • Build strong cross-functional partnerships resulting in proactive reliability improvements identified through error budget monitoring
  • Establish LineVision's SRE practice as a recognized capability, with documentation, runbooks, and processes that can scale with company growth

Benefits

  • Impact. Your talent, time, and energy will critically impact our success in accelerating our mission of providing utilities with grid intelligence to enable affordable, reliable power.
  • Ownership. You will hold broad responsibilities with high autonomy and trust in a communicative, collaborative, and fast-paced environment.
  • Flexibility. You will be empowered to maintain work-life balance with trust-based PTO and a flexible work schedule.
  • Real World Innovation. You will join a motivated and high-performing team working with cutting edge, patented technology to help solve key obstacles to meet the demands of an AI-powered future.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service