Site Reliability Engineer (SRE) / Operations Engineer

ECS Tech IncArlington, VA
3h$145,000 - $180,000Remote

About The Position

ECS is seeking a Site Reliability Engineer (SRE) / Operations Engineer to work in our Arlington, VA office / remote. ECS is seeking a Site Reliability Engineer (SRE) / Operations Engineer who is responsible for ensuring the reliability, availability, performance, and operational efficiency of enterprise applications and supporting infrastructure. This role bridges software engineering and IT operations by applying engineering practices, automation, and monitoring to maintain stable systems and rapidly resolve operational issues. The SRE/Ops Engineer works closely with development, security, and platform teams to support system deployments, manage incidents, improve observability, and implement resilient architectures that support continuous delivery and mission-critical operations.

Requirements

  • U.S. Citizenship
  • Ability to obtain at minimum a Public Trust suitability designation.
  • Bachelor's degree in Computer Science, Engineering, Information Technology, Information Systems, or a related field
  • Minimum of seven (7) years of related experience

Responsibilities

  • Maintain the reliability, availability, and performance of production systems and cloud-based services.
  • Monitor system health using observability tools (metrics, logs, and tracing) and respond to alerts and incidents.
  • Participate in incident response, troubleshooting, and root cause analysis to restore service and prevent recurrence.
  • Implement automation and infrastructure-as-code to improve operational efficiency and reduce manual intervention.
  • Support deployment pipelines and release management processes to enable reliable and repeatable software delivery.
  • Collaborate with development teams to improve application resiliency, scalability, and operational readiness.
  • Develop and maintain operational runbooks, standard operating procedures, and system documentation.
  • Manage system capacity planning, performance tuning, and scaling strategies.
  • Ensure systems comply with security, compliance, and organizational operational standards.
  • Contribute to continuous improvement initiatives by identifying opportunities to reduce operational risk and technical debt.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service