Lead Site Reliability Engineer

General Dynamics Information Technology
241d$144,500 - $195,500

This job is no longer available

There are still lots of open positions. Let's find the one that's right for you.

About The Position

GDIT is looking to hire a lead Site Reliability Engineer (SRE) to help take a cloud team to the next level. You will work with the government and other team members to identify and assist in enhancing the reliability of this agency's core cloud infrastructure. As an SRE you will act as an Account Manager and developer for core AWS accounts responsible for overseeing services running in this agency’s infrastructure AWS accounts. You will need to develop a deep understanding of how systems inter-operate within the infrastructure, including upstream and downstream dependencies. Responsible for reviewing all AWS infrastructure deployments to identify upstream and downstream impacts and ensure test processes fully validate feature and integration. This includes all Change Management activities in the environment. As an SRE ensure that monitoring, logging, and alerting for services running in core infrastructure accounts are properly configured and provide actionable information. Be able to develop new monitoring solutions based on findings that can help in preventing future issues. Develop metrics based on the SRE role and need to determine how the overall infrastructure is performing. Participate in any Emergency Responses and provide Incident Response metrics. In collaboration with government stakeholders, develop and maintain a logging and monitoring strategy for the infrastructure platform. Have the ability to code when required to meet the logging and monitoring strategy. Conduct and coordinate 5 Y’s and other blameless post-mortem activities in the event of an incident. Use this to drive out failures and one time only occurrences. Participate in continuous improvement activities such as technical debt analysis and contributing to the reliability standards and practices of the team. Work with team DevOps engineers to improve deployment process and introduce automated testing. Audit resources in accounts under your responsibility; identify areas for improvement or technical debt and collaborate with program and government partners to prioritize. Assist the cloud infrastructure team and other teams in troubleshooting wide area integration issues. Commit changes to our infrastructure codebase as necessary.

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service