SRE Engineer

Spatial Front•Arlington, VA

51d•Hybrid

About The Position

Spatial Front, Inc. (SFI), a two-time USAToday Top Workplaces awardee and Washington Top Workplaces honoree, is seeking a SRE Engineer to support our growing team. The SRE Engineer will support the Infrastructure, Production, and Compliance Support (IPCS) team within the enabling rail of a large-scale federal enterprise program. This role is responsible for improving the reliability, availability, performance, observability, and operational resilience of mission-critical systems supporting a complex, multi-environment ecosystem across development, test, training, and production. The SRE Engineer will help standardize and mature reliability engineering practices across a highly integrated environment that includes PeopleSoft-based enterprise applications, Oracle platforms, shared services, DevSecOps pipelines, and reporting/integration services operating in regulated NIPRNET and SIPRNET contexts. This position works closely with platform engineers, DevOps, release management, cybersecurity, test automation, and product teams to reduce operational toil, strengthen production readiness, improve incident response, and support continuous delivery without compromising stability or compliance. As a valued member of the SFI team, you will play a critical role in delivering mission-critical capabilities to our Federal Government customers.

Requirements

Bachelor's in Computer Science, Engineering, or related field.
5 years software engineering, 3 years site reliability engineering, production support engineering, or platform reliability for enterprise systems, 1 year unix/solaris experience.
Experience supporting enterprise applications in a high-availability, security-conscious, and compliance-driven environment.
Experience creating operational documentation, runbooks, and incident response procedures.
Strong troubleshooting skills across application, middleware, integration, and infrastructure layers
Strong verbal and written communication skills, including the ability to work across engineering, security, testing, and program stakeholders.
Demonstrated expertise in: Site reliability engineering, monitoring, automation, incident response, performance optimization; experienced with UNIX/Solaris.
Must be a U.S. Citizen.
Must possess an active Secret security clearance or be able to obtain one.

Nice To Haves

DevOps Engineer or equivalent SRE certification.
Experience supporting environments subject to RMF, STIG, audit, ATO, or similar compliance requirements.
Experience with Splunk, enterprise monitoring/observability tooling, or similar operational analytics platforms.
Experience supporting Oracle-based enterprise environments, including Oracle middleware, Oracle Database, or related platform services.
Experience supporting PeopleSoft or similarly complex ERP / HCM / payroll platforms.
Exposure to F5, Oracle Data Guard, Oracle GoldenGate, Kafka, or other enterprise integration / traffic / replication technologies.
Familiarity with scripting and automation using tools such as Shell, Python, or PowerShell.
Knowledge of DevOps, testing and scanning tools esp. within PeopleSoft environment such as PHIRE, PFT, Tricentis, Palo Alto, CAST etc.
Experience as an SRE supporting DoD or federal agency programs.
Familiarity with UNIX/Solaris administration and systems programming.
Experience with observability platforms such as Prometheus, Grafana, Datadog, or Splunk.

Responsibilities

Define, implement, and maintain site reliability engineering practices for mission-critical applications and shared services, with emphasis on uptime, resiliency, recoverability, and operational excellence.
Establish and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical services and environments.
Implement and maintain monitoring, alerting, and observability solutions for production systems.
Support production and pre-production operations across development, test, training, staging, and production environments.
Lead incident response activities, conducting root cause analysis and implementing permanent fixes.
Support capacity planning, performance analysis, trend monitoring, and scalability planning for enterprise platforms and services.
Create and maintain runbooks, standard operating procedures, incident playbooks, operational dashboards, and knowledge articles.
Support high availability, disaster recovery, backup/restore validation, and business continuity activities.
Develop and implement automation to reduce manual operational toil and improve system reliability.
Contribute to post-deployment validation, smoke testing, rollback readiness, and environment health checks during releases and maintenance windows.
Collaborate with teams supporting Oracle/PeopleSoft platforms, integration services, reporting services, and shared enterprise tooling to improve reliability end to end.
Collaborate with development teams to improve system reliability through design reviews and reliability engineering practices.
Perform capacity planning and performance optimization for production systems.
Other duties as assigned.