About The Position

Overview: · Seeking an experienced Observability and Monitoring Engineer to build and mature our enterprise-wide monitoring, logging, alerting, and observability capabilities across our AWS-based technology stack. · This role will define the strategy, architecture, implementation standards, and dashboards that enable proactive detection, faster troubleshooting, and data-driven insights across applications, infrastructure, operating systems, databases, file transfers, and batch processes. · The ideal candidate has hands-on engineering expertise, strong architecture skills, and the ability to unify multiple monitoring solutions into a cohesive observability framework. Responsibilities: · You will establish standards for logs, metrics, traces, event correlation, and alert across multiple environments · You will build centralized dashboards and alerting policies that provide unified visibility across: applications & services, operating systems, AWS services (EC2, RDS, Lambda, S3, CloudWatch, CloudTrail, etc.), databases (MS SQL Server, PostgreSQL, etc.), file transfer systems (SFTP, managed transfer tools), batch jobs and scheduled processes. · You will create actionable and noise-free alerting thresholds, escalation policies, and runbooks. · You will integrate existing tools (Dynatrace, Graylog, Splunk, SolarWinds, Zabbix) into a cohesive ecosystem. · You will rationalize tool usage and recommend consolidation or modernization where appropriate. · You will manage the lifecycle, configuration, tuning, and health of monitoring and logging platforms, automate monitoring deployments using IaC (CloudFormation) and CI/CD pipelines, and develop reusable templates/standards so teams can onboard new applications quickly. · You will build self-service dashboards and reporting for technical/business stakeholders, create documentation for monitoring standards, dashboard naming conventions, logging schemas, and alert configuration guidelines. · You will define SLOs/SLIs and reliability KPIs for critical services. · You will partner with scrum teams, infrastructure, and security teams to reduce MTTR and improve system reliability, participate in incident resolution, root cause analysis, and problem management. · You will provide technical leadership/mentoring to team members and consult on architecture decisions and best practices. · You will Develop/maintain system documentation and participate in project planning and technical strategy sessions. Qualifications: · Bachelor's degree in Computer Science or related field · 5+ years of experience implementing monitoring and observability using Dynatrace · Hands-on experience with monitoring/logging tools such as Zabbix, Graylog, Splunk, SolarWinds, or equivalents · 5+ years of hands-on experience with AWS services and architecture · Deep understanding of metrics, logs, traces, distributed tracing, and event correlation · Experience building dashboards and KPIs for application, infrastructure, and database layers · Strong scripting/automation skills (Python, Bash, PowerShell) and familiarity with Terraform or CloudFormation · Strong understanding of network monitoring, performance tuning, and systems architecture · Familiarity with ITIL incident/problem management processes · Proficiency with AI tools and using them responsibly in improving observability preferred · Experience with container orchestration and microservices architecture preferred · Experience with AWS OpenTelemetry, Prometheus, Grafana, or similar tools preferred Required Technical Skills: AWS Services (EC2, RDS, S3, Lambda, ECS/EKS, etc.) Configuration Management (Ansible, Puppet, Chef) Monitoring Tools (Dynatrace, CloudWatch, Zabbix, Solarwinds, Graylog etc.) CI/CD Tools (Jenkins, Quickbuild, Bitbucket) Scripting Languages (Python, PowerShell, Bash) Database Management (MS SQL Server, PostgreSQL) Infrastructure as Code (Terraform, CloudFormation) Container Technologies (Docker, Kubernetes) Compensation: $45.00 per hour About Us AHU Technologies INC. is an IT consulting and permanent staffing firm that meets and exceeds the evolving IT service needs of leading corporations within the United States. We have been providing IT solutions to customers from different industry sectors, helping them control costs and release internal resources to focus on strategic issues. AHU Technologies INC. was co-founded by visionary young techno-commercial entrepreneurs who remain as our principal consultants. Maintaining working relationships with a cadre of other highly skilled independent consultants, we have a growing number of resources available for development projects. We are currently working on Various projects such as media entertainment, ERP Solutions, data warehousing, Web Applications, Telecommunications and medical to our clients all over the world.

Requirements

  • Bachelor's degree in Computer Science or related field
  • 5+ years of experience implementing monitoring and observability using Dynatrace
  • Hands-on experience with monitoring/logging tools such as Zabbix, Graylog, Splunk, SolarWinds, or equivalents
  • 5+ years of hands-on experience with AWS services and architecture
  • Deep understanding of metrics, logs, traces, distributed tracing, and event correlation
  • Experience building dashboards and KPIs for application, infrastructure, and database layers
  • Strong scripting/automation skills (Python, Bash, PowerShell) and familiarity with Terraform or CloudFormation
  • Strong understanding of network monitoring, performance tuning, and systems architecture
  • Familiarity with ITIL incident/problem management processes

Nice To Haves

  • Proficiency with AI tools and using them responsibly in improving observability preferred
  • Experience with container orchestration and microservices architecture preferred
  • Experience with AWS OpenTelemetry, Prometheus, Grafana, or similar tools preferred

Responsibilities

  • You will establish standards for logs, metrics, traces, event correlation, and alert across multiple environments
  • You will build centralized dashboards and alerting policies that provide unified visibility across: applications & services, operating systems, AWS services (EC2, RDS, Lambda, S3, CloudWatch, CloudTrail, etc.), databases (MS SQL Server, PostgreSQL, etc.), file transfer systems (SFTP, managed transfer tools), batch jobs and scheduled processes.
  • You will create actionable and noise-free alerting thresholds, escalation policies, and runbooks.
  • You will integrate existing tools (Dynatrace, Graylog, Splunk, SolarWinds, Zabbix) into a cohesive ecosystem.
  • You will rationalize tool usage and recommend consolidation or modernization where appropriate.
  • You will manage the lifecycle, configuration, tuning, and health of monitoring and logging platforms, automate monitoring deployments using IaC (CloudFormation) and CI/CD pipelines, and develop reusable templates/standards so teams can onboard new applications quickly.
  • You will build self-service dashboards and reporting for technical/business stakeholders, create documentation for monitoring standards, dashboard naming conventions, logging schemas, and alert configuration guidelines.
  • You will define SLOs/SLIs and reliability KPIs for critical services.
  • You will partner with scrum teams, infrastructure, and security teams to reduce MTTR and improve system reliability, participate in incident resolution, root cause analysis, and problem management.
  • You will provide technical leadership/mentoring to team members and consult on architecture decisions and best practices.
  • You will Develop/maintain system documentation and participate in project planning and technical strategy sessions.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service