Observability and Monitoring Engineer

QTC Management, Inc.

22h•Hybrid

About The Position

Leidos QTC Health Services is seeking an Observability and Monitoring Engineer. This role involves designing, implementing, and maintaining systems that provide insights into the performance, availability, and reliability of applications and infrastructure. The engineer will work with monitoring tools, logging systems, distributed tracing, and alerting mechanisms to proactively detect and resolve issues. The position requires a combination of technical hands-on skills and soft skills, including building technical project plans, design, and transitioning services to an operational steady state. The role is considered a driver of technology and forward thinking as LQTC continues to drive its enterprise. Leidos QTC Health Services collaborates closely with government and non-government customers to address current and future program needs within the health services domain. They specialize in disability-focused medical examinations, independent medical exams and review services, occupational health services, diagnostic testing, and case management solutions. As innovators, they focus on advancing technologies that improve service delivery, with a particular emphasis on enhancing accessibility for examinees in rural communities. With a proven track record of continuous improvement and steady growth, they now handle over 2 million appointments annually.

Requirements

Bachelor’s degree in computer science, business administration, related field, or possess equivalent work experience in lieu of degree.
9+ years of experience designing and implementing infrastructure solutions.
9+ years of industry relevant experience.
Understanding/use of SDLC, Agile, six sigma to drive fit for purpose technologies.
Must be able to successfully pass National Agency Check with Inquiries (NACI) background investigation.

Nice To Haves

Relevant technical certifications a plus.
Ability to work effectively in a team environment.
Ability to utilize discretion and independent judgment to switch between priorities quickly without affecting quality or performance.
Excellent written and verbal communication skills.
Superior customer service skills.
Ability to work with minimal supervision.
Solid organization and planning skills, with strong attention to detail.
Advanced level knowledge of infrastructure, OS and database technologies to include but not limited to Windows, LINUX, Oracle, SQL, Active Directory, load balancing and fire wall technologies, network switching and routing, and core infrastructure such as compute, storage, virtualization, data protection, business continuity.
Working experience in observability and automation in a hybrid environment (on premise / AWS cloud).
Must possess the ability and flexibility to work extra hours and weekends.
Ability and desire to take ownership of work assignments and drive tasks to completion.
Engineering mindset while designing, rolling out new service/product and providing T3 support.
Solid organization and planning skills, with strong attention to detail.
Working proficiency in Monitoring and Logging tools (ie; Splunk, Prometheus, Solarwinds, Dynatrace and open source tools).
Understanding of business drivers and metrics and the ability to translate them into measurable infrastructure metrics to drive proactive engagement and resolution of issues. The use of AI tools is a plus.
Engineering mindset in an enterprise environment.
Advanced Working knowledge of Cloud services including SaaS, PaaS and IaaS capabilities across the main hyperscaler providers.
Working familiarity with Microservices in a hybrid ecosystem including Kubernetes, Docker, and other containerization solutions desired.
Working knowledge of ITIL and ITSM to define and maintain Service-Level Objectives (SLOs) Defining SLOs, Service-Level Indicators (SLIs), and Service-Level Agreements (SLAs).

Responsibilities

Partnership with leadership, along with Infrastructure and Development Service Owners to capture requirements and baselines.
Advise on LQTC’s Observability and Monitoring Strategy.
Engineer and implement Monitoring & Alerting (e.g., Splunk, Solarwinds, Dynatrace, Status Cake) from which configured alerts can be dispatched to appropriate engineers and consolidated into the current incident management solution.
Engineer and maintain Log Aggregation schemas to collect and analyze system logs to allow for proactive issue resolution.
Performance Analysis to identify and troubleshoot performance bottlenecks in infrastructure, applications, and networks.
Partner with other resources to engineer and maintain a Single Pain of Glass solution with underlying specific Dashboarding & Reporting to provide real-time visibility into system health.
Incident response and collaboration with Infrastructure teams, SREs, and DevOps teams to respond to system outages and performance issues.
Proactively address the Command Center’s Capacity Plan and analyze trends in system usage to optimize resources and prevent downtime.
Ensure that the observability ecosystem is engineered and maintained to meet security and compliance requirements working with our Cyber teams.