About The Position

We at Confiz are hiring an Application Support Engineer with hands-on experience in monitoring tools, backend API troubleshooting, and incident management. Join our team to ensure seamless application performance and drive operational excellence.

Requirements

  • Bachelor’s or master's degree in computer science, Engineering, Information Technology, or a related field
  • 3+ years of experience in Application Support, Production Support, or Site Reliability Engineer roles
  • Must have Hands-on experience of Azure cloud platform
  • Strong knowledge of backend APIs (REST/GraphQL) and ability to read & troubleshoot logs
  • Hands-on experience with Azure App Insights and at least one observability platform such as New Relic, Dynatrace, Grafana, or Splunk
  • Proficiency in SQL with ability to write complex queries for data validation and impact analysis
  • Experience in dashboard creation, alert configuration, and monitoring solutions
  • Strong problem-solving skills with the ability to identify patterns across incidents
  • Excellent written skills for ticket documentation and RCA preparation
  • Strong collaboration skills across engineering, QA, and operations teams
  • Proactive attitude with a mindset of continuous improvement and ownership
  • Experience in system performance monitoring and analyzing health metrics
  • Understanding of service SLAs, error budgets, and uptime reporting
  • Experience with incident management and on-call support processes
  • Familiarity with release management and post-release validation

Responsibilities

  • Troubleshoot issues by analyzing backend API logs and Azure App Insights
  • Build and maintain dashboards and alerts in Azure App Insights to monitor APIs, endpoints, and system/app health (spikes, deviations, anomalies)
  • Create monitoring solutions in platforms like New Relic, Dynatrace, Splunk, or Grafana
  • Ensure all support tickets are fully documented with detailed log analysis, journey insights, and resolution updates
  • Perform post-release log analysis to identify early issues or abnormal patterns
  • Ensure service SLAs are consistently met and take proactive measures to prevent breaches
  • Detect and investigate recurring issues/patterns across multiple cases and escalate to leads/ops teams
  • Write clear Root Cause Analysis (RCA) reports for all major incidents P1/P2 , including impacted customer identification via logs
  • Use SQL effectively for data validation, impact analysis, and troubleshooting
  • Maintain a proactive approach , continuously improving monitoring and incident management processes
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service