Site Reliability Engineer

Impact.comVictoria, BC
Hybrid

About The Position

As a Site Reliability Engineer, you'll be the champion of performance and stability for our core application ecosystem. Working closely with our Java and C# engineering squads, you'll ensure that our high-frequency data ingestion pipelines and customer-facing applications meet strict performance and error rate benchmarks. Your mission is to bridge the gap between code and infrastructure, implementing OpenTelemetry (OTel) standards across the stack to provide deep visibility into how we interact with external social APIs and how our internal services communicate.

Requirements

  • Strong proficiency in Java or C#. You are comfortable reading, debugging, and instrumenting application code.
  • Hands-on experience with OpenTelemetry, including auto-instrumentation, manual spans, and collector configuration.
  • Deep experience with the Grafana ecosystem (Prometheus, Tempo, Loki) or similar distributed tracing platforms (Jaeger, Honeycomb, Datadog).
  • Experience working with high-volume REST/Graph APIs and an understanding of OAuth flows, rate-limiting, and webhooks.
  • Solid understanding of how Java/C# applications interact with the underlying infrastructure.
  • Ability to prioritize tasks in a high-velocity environment and a focus on building "self-healing" systems rather than manual fixes.
  • B.S. in Computer Science, or equivalent practical experience in a high-scale production environment.

Nice To Haves

  • Affiliate & Partnerships Industry Fundamentals Certification by PXA

Responsibilities

  • Become the architect of our observability pipeline. Implement and maintain OpenTelemetry instrumentation across Java and C# services to ensure high-fidelity traces, metrics, and logs.
  • Build integration tests with third-party social APIs and setup the appropriate monitoring and alerting systems to ensure high availability and reliability.
  • Build and enhance Grafana dashboards and alerting systems that track the "Golden Signals" (Latency, Traffic, Errors, Saturation) specifically tailored for JVM and .NET environments.
  • Drive root-cause analysis (RCA) for complex distributed system failures and contribute to remediations through code optimizations or infrastructure adjustments.
  • Leverage tracing data to identify bottlenecks in cross-service communication and optimize the path of data from social APIs to our internal stores.
  • Debug issues across the entire stack, from containerized application code (Java/C#) down to network calls and cloud resource utilization.
  • Analyze application usage patterns to inform scaling decisions, ensuring we handle social data bursts without compromising stability or overspending on cloud costs.

Benefits

  • Health & Prescription coverage
  • vision and dental care
  • virtual health care
  • out-of-country medical coverage
  • life insurance
  • short-term disability
  • long-term disability
  • Health Care Spending Account
  • Two different Employee Assistance Programs
  • Responsible PTO policy
  • mental health and wellness benefit (up to 12 fully covered therapy/coaching sessions per year, with additional dependent coverage)
  • monthly gym reimbursement policy
  • Restricted Stock Units (RSUs) with a 3-year vesting schedule
  • free Coursera subscription
  • PXA courses
  • 26 weeks of fully paid leave for the primary caregiver
  • 13 weeks fully paid leave for the secondary caregiver
  • technology stipend
  • monthly allowance to cover internet expenses
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service