Prove-posted 2 months ago
$165,000 - $180,000/Yr
Full-time • Senior
251-500 employees

We are seeking an experienced Senior Site Reliability Engineer to join our Platform Engineering team. In this role, you will be instrumental in designing, implementing, maintaining and deploying highly available complex, scalable and reliable systems leveraging automation, effective monitoring and infrastructure-as code. Working closely with our application engineering teams to ensure our services meet the highest standards of reliability, performance, and security.

  • Design and implement comprehensive observability solutions across our infrastructure and within applications
  • Lead the initiative to establish a companywide instrumentation standard based in Opentelemetry wide events
  • Build advanced monitoring dashboards that provide real-time visibility into system health and performance
  • Establish metrics, logging, and tracing systems that enable quick identification and resolution of issues
  • Create alerting thresholds and automated responses based on service level objectives (SLOs)
  • Drive a culture of observability throughout the engineering organization
  • Lead Kubernetes cluster management, optimization, and scaling initiatives
  • Design and implement infrastructure-as-code deployments for container-based applications
  • Optimize container resource allocation and utilization
  • Build automated deployment pipelines that ensure consistent, reliable releases
  • Establish best practices for containerization and orchestration across teams
  • Design, build, and maintain scalable cloud infrastructure on AWS
  • Implement infrastructure-as-code using tools such as Terraform
  • Automate routine operational tasks to reduce toil and improve efficiency
  • Ensure infrastructure security compliance and implement least-privilege access controls
  • Optimize cloud resource utilization and costs
  • Integrate observability-driven alerts with our Incident Management systems
  • Lead incident response efforts during service disruptions
  • Conduct thorough post-incident reviews and implement preventative measures
  • Use observability data to perform root cause analysis and system improvements
  • Document incidents, responses, and lessons learned to build organizational knowledge
  • Identify and resolve performance bottlenecks across the technology stack
  • Conduct capacity planning and scaling exercises to meet future demands
  • Implement auto-scaling solutions based on performance metrics
  • Optimize database performance and query efficiency
  • Design and implement application stress testing methods and systems
  • 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
  • Expert knowledge of observability platforms and practices (OpenTelemetry, Prometheus, Grafana, Jaeger, ELK stack / Splunk, etc)
  • Experience with Kubernetes and container orchestration
  • Strong experience with infrastructure-as-code tools (Terraform, CloudFormation, Pulumi)
  • Proficiency in at least one programming language (Go, Python, Java)
  • Deep understanding of cloud platforms (AWS, GCP, or Azure)
  • Experience implementing and managing CI/CD pipelines
  • Knowledge of network architecture and security principles
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • Experience with distributed systems and microservice architectures
  • Knowledge of security best practices and compliance requirements
  • Experience with database administration (PostgreSQL, MySQL)
  • Familiarity with service mesh technologies (Istio, Linkerd)
  • Contributions to open-source projects
  • AWS/GCP certifications
  • Experience in the identity verification or financial technology industry
  • Hands-on experience with OpenTelemetry standards and technology
  • Application development experience
  • Competitive salaries & Bonus Plan (for eligible roles) and Equity Plan
  • Modern Health for financial, mental, and physical wellness
  • 401(k) Retirement Plan & Match (US Offices) and Local Country Pension (International Offices)
  • Unlimited Vacation and Flexible hours
  • Comprehensive medical benefits for you and your family
  • Emotional & Physical Wellness – Access to wellness services (EAP & Prove Well-Being Reimbursement)
  • Bottomless snacks & beverages for certain office locations
  • Daily GrubHub stipend for lunch if coming into the office (US Offices)
  • A great place to work and connect with other talented Provers like yourself!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service