Site Reliability Engineer

Cogent People Inc•Columbia, MD

16d•Hybrid

About The Position

Cogent People Inc. is seeking a Site Reliability Engineer to support system reliability, monitoring, and operational stability across environments. This role is responsible for implementing observability and automation practices, supporting production systems, and ensuring system performance and availability. The position plays a key role in incident response, root cause analysis, and ongoing system optimization in collaboration with DevOps and development teams. The ideal candidate will bring experience in system monitoring, DevOps practices, and production support, along with the ability to collaborate across cross-functional engineering teams in a fast-paced environment. This position may be contingent upon contract award.

Requirements

Bachelor’s degree in Computer Science, Information Systems, or a related field, or an equivalent combination of education and experience
Experience in system reliability, DevOps, or production support roles
Experience with monitoring, logging, and observability tools
Understanding of incident management and root cause analysis processes
Familiarity with cloud environments and infrastructure concepts
Experience supporting automated deployment or operational workflows
Strong problem-solving and troubleshooting skills
Excellent written and verbal communication skills
Ability to work effectively in fast-paced, production-critical environments
Strong collaboration skills across development and operations teams
Must be a U.S. Citizen, Permanent Resident, or valid EAD holder
Must have lived in the United States for at least 3 of the past 5 years
Must be currently authorized to work in the U.S. without sponsorship

Nice To Haves

Experience with AWS or other cloud platforms
Familiarity with infrastructure-as-code tools (e.g., Terraform or similar)
Experience with tools such as Splunk, Datadog, Prometheus, or similar observability platforms
Experience with CI/CD pipelines and DevOps automation tools
Prior experience supporting enterprise-scale or regulated environments
Knowledge of application performance tuning and distributed systems behavior

Responsibilities

Support system reliability, monitoring, and operational stability across environments
Implement and maintain observability practices, including monitoring, logging, and alerting
Contribute to automation efforts that improve system reliability and operational efficiency
Participate in incident response activities and production support
Perform root cause analysis for system issues and outages
Support performance optimization and tuning of applications and infrastructure
Work with DevOps and development teams to maintain production readiness
Contribute to continuous improvement of deployment and operational processes
Collaborate across engineering teams to support stable and scalable systems