Site Reliability Engineer

Impact.com•Victoria, BC

2d•Hybrid

About The Position

As a Site Reliability Engineer, you'll be the champion of performance and stability for our core application ecosystem. Working closely with our Java and C# engineering squads, you'll ensure that our high-frequency data ingestion pipelines and customer-facing applications meet strict performance and error rate benchmarks. Your mission is to bridge the gap between code and infrastructure, implementing OpenTelemetry (OTel) standards across the stack to provide deep visibility into how we interact with external social APIs and how our internal services communicate.

Requirements

Strong proficiency in Java or C#. You are comfortable reading, debugging, and instrumenting application code.
Hands-on experience with OpenTelemetry, including auto-instrumentation, manual spans, and collector configuration.
Deep experience with the Grafana ecosystem (Prometheus, Tempo, Loki) or similar distributed tracing platforms (Jaeger, Honeycomb, Datadog).
Experience working with high-volume REST/Graph APIs and an understanding of OAuth flows, rate-limiting, and webhooks.
Solid understanding of how Java/C# applications interact with the underlying infrastructure.
Ability to prioritize tasks in a high-velocity environment and a focus on building "self-healing" systems rather than manual fixes.
B.S. in Computer Science, or equivalent practical experience in a high-scale production environment.

Nice To Haves

Affiliate & Partnerships Industry Fundamentals Certification by PXA

Responsibilities

Become the architect of our observability pipeline. Implement and maintain OpenTelemetry instrumentation across Java and C# services to ensure high-fidelity traces, metrics, and logs.
Build integration tests with third-party social APIs and setup the appropriate monitoring and alerting systems to ensure high availability and reliability.
Build and enhance Grafana dashboards and alerting systems that track the "Golden Signals" (Latency, Traffic, Errors, Saturation) specifically tailored for JVM and .NET environments.
Drive root-cause analysis (RCA) for complex distributed system failures and contribute to remediations through code optimizations or infrastructure adjustments.
Leverage tracing data to identify bottlenecks in cross-service communication and optimize the path of data from social APIs to our internal stores.
Debug issues across the entire stack, from containerized application code (Java/C#) down to network calls and cloud resource utilization.
Analyze application usage patterns to inform scaling decisions, ensuring we handle social data bursts without compromising stability or overspending on cloud costs.

Benefits

Health & Prescription coverage
vision and dental care
virtual health care
out-of-country medical coverage
life insurance
short-term disability
long-term disability
Health Care Spending Account
Two different Employee Assistance Programs
Responsible PTO policy
mental health and wellness benefit (up to 12 fully covered therapy/coaching sessions per year, with additional dependent coverage)
monthly gym reimbursement policy
Restricted Stock Units (RSUs) with a 3-year vesting schedule
free Coursera subscription
PXA courses
26 weeks of fully paid leave for the primary caregiver
13 weeks fully paid leave for the secondary caregiver
technology stipend
monthly allowance to cover internet expenses