Observability & Operations Engineer

Fullbay•Phoenix, AZ

53d

About The Position

Observability & Operations Engineer About Us: Fullbay is a leading SaaS organization dedicated to providing exceptional products/services to our clients. We are passionate about growth, innovation, and delivering top-notch customer experiences. Join our dynamic team and be a part of shaping the future. Position Overview: The Observability & Operations Engineer is a key technical contributor who brings an AI-first mindset to maintaining, monitoring, and operating our AWS cloud environment and internal Developer Platform. In this role, you won’t just react to incidents — you’ll leverage AI-powered tooling, intelligent alerting, and automation to get ahead of problems before they impact users. You’ll work deeply across AWS and its PaaS ecosystem, building repeatable, code-first pipelines that treat infrastructure and observability configuration as first-class software. From using AI coding assistants to accelerate runbook development, to applying ML-based anomaly detection across logs and metrics, you’ll be expected to ask “how can AI help here?” as a first instinct. Working within a dedicated platform team, you’ll build the observability foundations that keep our systems fast, reliable, and self-healing.

Requirements

7 –10 years of experience in Software Engineering, Cloud Operations, or Site Reliability Engineering
5+ years of hands-on experience with AWS infrastructure and AWS PaaS services; certifications are a plus
Demonstrated experience building repeatable, code-first pipelines and treating operational configuration as first-class software
Experience working with polyglot environments including Java, Kotlin, and Node.js
Demonstrated experience using AI tools (coding assistants, AI-powered observability platforms, or similar) in a professional setting — we’re an AI-first company and expect this to be part of how you work, not something you’re just exploring
Deep experience with enterprise observability platforms — including AWS-native tooling such as CloudWatch, X-Ray, and OpenTelemetry, or comparable platforms such as Datadog, Grafana, or Prometheus
Proficiency with distributed tracing frameworks and log management platforms (e.g. ELK Stack, Splunk, Fluent Bit); experience mapping these patterns to AWS-native tooling is a strong plus
Strong understanding of SRE principles including SLOs, SLAs, error budgets, and chaos engineering
Hands-on FinOps experience — cloud cost allocation, chargeback modeling, rightsizing, and savings plans optimization across AWS
Strong working knowledge of AWS PaaS services including Lambda, API Gateway, ECS, RDS, SQS, SNS, and IAM — and how to leverage them to build scalable operational tooling
Experience instrumenting polyglot applications (Java, Kotlin, Node.js) and cloud-native microservices for observability
Proven ability to build repeatable, code-first pipelines — treating dashboards, alerts, runbooks, and infrastructure configuration as versioned, testable software
Experience with CI/CD tooling, specifically Harness
Solid understanding of Infrastructure as Code using Terraform
Fluency with AI tools in day-to-day work — whether that’s AI coding assistants, AI-powered monitoring features, or using LLMs to accelerate problem solving; you default to asking “can AI help here?” before doing things the hard way
Ability to lead incident response, facilitate blameless post-mortems, and drive long-term reliability improvements
Strong collaboration skills for working across platform and product engineering teams
Knowledge of containerization technologies and microservices architecture

Responsibilities

Design and implement a comprehensive observability strategy (logging, metrics, tracing, alerting) across all AWS environments, leveraging AI-powered tools to detect anomalies and surface insights automatically
Build and manage monitoring platforms such as Datadog, Grafana, Prometheus, and AWS CloudWatch — actively exploring AI-native features within these tools to reduce alert fatigue and improve signal quality
Use AI coding assistants (e.g. GitHub Copilot, Claude) to accelerate development of dashboards, runbooks, and automation scripts
Own the incident management lifecycle — on-call rotations, post-mortems, root cause analysis — and apply AI-assisted log analysis to speed up diagnosis and resolution
Instrument Java, Kotlin, and Node.js-based cloud-native applications to emit structured logs, distributed traces, and metrics; identify opportunities to use ML-based anomaly detection in place of static thresholds
Build repeatable, code-first observability pipelines that treat dashboards, alerts, and runbooks as first-class software — versioned, tested, and deployed through Harness
Leverage AWS PaaS services (Lambda, API Gateway, ECS, RDS, SQS, SNS, and others) to build scalable, automated operational tooling
Collaborate with development teams to embed observability and AI-assisted quality checks into CI/CD pipelines via Harness
Own the FinOps function for our AWS environment — tracking cloud spend, building cost dashboards, identifying waste, and using AI-powered cost analysis tools to surface optimization opportunities and drive accountability across engineering teams
Monitor AWS infrastructure for performance, availability, and cost — partnering with finance and engineering to enforce spend governance
Develop and maintain Infrastructure as Code using Terraform, using AI pair programming to improve quality and consistency
Contribute to architectural decisions with a focus on resilience, automation, and reducing toil through intelligent systems
Adheres to all confidentiality and compliance regulations
Performs other duties as assigned