As a Site Reliability Engineer, you will be responsible for operational excellence and incident management, maintaining and monitoring production systems for availability, latency, and performance. You will lead incident response efforts, including communication, resolution, and postmortem documentation. You will also design and implement health checks, alerting systems, and automated remediation workflows, and drive root cause analysis and implement permanent resolutions for recurring issues. Additionally, you will set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK, analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement, and conduct post-incident reviews and use insights to inform future engineering investments. You will tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency, work with developers to evolve architecture and improve system throughput, latency, and stability, and optimize PostgreSQL performance, queries, and maintenance strategies. Finally, you will design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI, automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency, and standardize infrastructure as code practices across environments.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Senior
Education Level
No Education Listed