Site Reliability Engineer (SRE)

Bright Vision TechnologiesEdison, NJ
Remote

About The Position

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. As we continue to grow, we’re looking for a skilled Site Reliability Engineer (SRE) to join our dynamic team and contribute to our mission of transforming business processes through technology. This is a fantastic opportunity to join an established and well-respected organization offering tremendous career growth potential.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
  • Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems.
  • Strong programming skills in at least one of Python, Go, or Java, with the ability to build robust automation and tooling.
  • Deep, hands-on experience operating Linux at scale, including networking, performance tuning, and systems-level troubleshooting.
  • Production experience operating Kubernetes and container-based workloads.
  • Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents.
  • Hands-on experience designing and operating CI/CD pipelines for both infrastructure and applications.
  • Solid understanding of distributed system design, including consistency models, partitioning, and failure semantics.
  • Demonstrated experience leading incident response and conducting effective post-incident reviews.
  • Excellent communication and documentation skills.

Nice To Haves

  • Experience defining and operationalizing SLOs and error budgets in real production environments.
  • Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus.
  • Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
  • Background in capacity planning, performance engineering, or large-scale load testing.
  • Familiarity with service mesh technologies such as Istio, Linkerd, or Consul.

Responsibilities

  • Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services, and use those measures to drive concrete engineering and prioritization decisions.
  • Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed, and ensuring high-quality post-incident reviews that drive lasting improvements.
  • Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling so that operators have rich, actionable visibility into system behavior.
  • Build and maintain robust on-call processes, runbooks, and escalation paths that reduce mean time to detect and mean time to resolve while protecting the well-being of the engineers on rotation.
  • Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages, replacing manual workflows with reliable, auditable automation.
  • Architect and operate large-scale Kubernetes clusters and container-based workloads, including autoscaling, capacity planning, network policy, and integration with service meshes.
  • Design CI/CD pipelines that promote safe, frequent, and observable releases, supported by automated testing, canary deployments, feature flags, and progressive rollout strategies.
  • Lead capacity planning and performance engineering activities, building models that predict growth and stress, and validating those models through load testing and chaos experiments.
  • Partner closely with application development teams to embed reliability practices early in design — including failure-mode analyses, graceful degradation patterns, and dependency hardening.
  • Strengthen the platform’s resiliency through chaos engineering, fault injection, dependency isolation, retries, timeouts, circuit breakers, and well-tested failover paths.
  • Drive continuous improvement of security posture in collaboration with security teams, including patch management, vulnerability remediation, and secure-by-default platform defaults.
  • Contribute to the technical roadmap for reliability tooling, observability platforms, and developer-experience improvements that reduce friction and improve outcomes for engineering teams.
  • Mentor engineers across the organization on SRE practices and foster a strong, blameless culture of operational excellence.

Benefits

  • Competitive base salary commensurate with experience, plus benefits.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service