Site Reliability Engineering Lead

LineVisionBoston, MA
1d$160,000 - $180,000Hybrid

About The Position

Hybrid: Boston, MA Headquarters (1-2 days/week in office) Lead the establishment of LineVision's SRE practice and shape how we deliver grid-grade reliability. We are seeking a Site Reliability Engineering Lead to build our first dedicated SRE function from the ground up—defining the standards, practices, and frameworks that ensure our grid intelligence platform meets the exceptional reliability requirements of utility customers operating mission-critical infrastructure. This is a high-impact, individual contributor role where you'll be both hands-on implementing reliability infrastructure and strategic in driving organizational adoption of SRE practices. If you are looking to combine deep technical expertise with cross-functional influence to establish reliability practices that directly impact grid operations, join us at LineVision, Built In Boston Best Places to Work !

Requirements

  • SRE Practice Building : Demonstrated experience establishing SRE practices including defining SLOs, implementing error budgets, and driving organizational adoption—not just maintaining existing practices
  • Cross-Functional Planning & Influence : Proven ability to plan and sequence complex initiatives across multiple teams, influence without authority, and drive technical standards adoption
  • AWS Expertise : Deep hands-on experience with production AWS services including EC2, RDS, Lambda, VPC configuration, and networking
  • Observability & Monitoring : Expert proficiency with tools like Datadog, Prometheus, Grafana, or CloudWatch for instrumenting distributed systems
  • Infrastructure as Code : Strong experience with Terraform, CloudFormation, or Pulumi
  • Programming & Automation : Python and TypeScript experience for instrumentation, automation, and tooling

Nice To Haves

  • Experience establishing SRE practices at high-scale technology companies and translating them to different organizational contexts
  • Background in energy, utility, or critical infrastructure sectors where reliability directly impacts operations
  • Track record driving technical standards adoption across engineering organizations without direct authority
  • Strategic thinking about balancing quick wins with long-term infrastructure investments
  • Can operate at both tactical (hands-on implementation) and strategic (organizational influence) levels

Responsibilities

  • Establish LineVision's SRE practice from the ground up - define Service Level Objectives, implement observability frameworks, and build deployment safety guardrails while driving organizational adoption of SRE methodologies
  • Be hands-on with reliability infrastructure - instrument services, configure monitoring tools, build dashboards, create alerting frameworks, and establish incident response procedures
  • Plan and influence across teams - partner strategically with engineering, platform, product, and customer support to sequence SRE initiatives, balance competing priorities, and drive adoption of reliability standards without direct authority
  • Communicate reliability as business value - translate technical metrics, error budgets, and system health into business impact for both technical teams and executive stakeholders
  • Conduct comprehensive assessment of current infrastructure and establish baseline reliability metrics for critical services
  • Build relationships with engineering, platform, product, and customer support teams to understand pain points and align on priorities
  • Define initial SLOs for 2-3 highest-priority services and implement foundational monitoring dashboards
  • Establish incident response framework with escalation paths and blameless post-incident review process
  • Deploy production observability framework with instrumented golden signals, actionable alerting, and comprehensive dashboards
  • Implement CI/CD improvements including automated testing gates, canary deployments, and rollback capabilities
  • Partner with engineering teams to operationalize SLOs and use error budgets to inform roadmap decisions
  • Document SRE standards and runbooks that become organizational reference materials
  • Achieve measurable improvements in deployment success rates, MTTR, and system reliability through hands-on implementation and organizational influence
  • Establish error budget framework that influences product and engineering decision-making across teams
  • Build LineVision's SRE capability as a recognized practice with documented processes and frameworks that scale with company growth

Benefits

  • Impact. Your talent, time, and energy will critically impact our success in accelerating our mission of providing utilities with grid intelligence to enable affordable, reliable power.
  • Ownership. You will hold broad responsibilities with high autonomy and trust in a communicative, collaborative, and fast-paced environment.
  • Flexibility. You will be empowered to maintain work-life balance with trust-based PTO and a flexible work schedule.
  • Real World Innovation. You will join a motivated and high-performing team working with cutting edge, patented technology to help solve key obstacles to meet the demands of an AI-powered future.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service