Senior Software Engineer - SRE

OneTrustAtlanta, GA
$116,475 - $174,713Hybrid

About The Position

OneTrust's mission is to enable innovation through the responsible use of data and AI. We believe that ensuring data is trusted shouldn’t slow teams down—it should accelerate what’s possible. This led us to develop the first technology platform for responsible data use in 2016. Today, with AI representing the latest and most impactful expansion of data yet, OneTrust is once again redefining what responsible innovation looks like. OneTrust, the AI‑Ready Governance Platform™, unifies regulatory intelligence, automation, and connected governance workflows so businesses can continue to move at the speed of AI while ensuring good governance to prevent data misuse at scale. Trusted by thousands of organizations worldwide, OneTrust is shaping the future where trusted data becomes a transformative force for business and society.

Requirements

  • Bachelor's degree in computer science, Engineering, or related technical or business field
  • 4+ yrs. of application development experience with Java or other equivalent language
  • Experience with Spring environment
  • Experience in cloud-based infrastructure (Azure, AWS, GCP, etc.)
  • Experience with the factors that affect software application performance at different levels. These factors include database performance, network performance, CPU utilization, JVM tuning, memory analysis, thread management, and query performance.
  • A knowledge of the importance of centralizing logging, metrics dashboards, and alerting. Able to articulate about some of the tools used for these tasks
  • A good awareness of databases (ideally SQL/NoSQL)
  • Hands-on experience with observability tools (Datadog, Prometheus, Grafana, etc.)
  • Knowledge with CI/CD pipelines and infrastructure-as-code (Terraform, Helm, jenkins, gitlab)
  • Build and operate AI-assisted incident response systems (root cause analysis, log summarization, anomaly triage)
  • Develop or integrate LLM-based tools to reduce MTTR and improve alert quality
  • Apply machine learning techniques for anomaly detection, capacity prediction, or failure pattern analysis
  • Experience deploying AI systems in production (not just experimentation)
  • Knowledge with vector databases, embeddings, or RAG architectures for operational intelligence
  • Well-developed insight of prompt engineering and evaluation of LLM outputs in the reliability workflow
  • Kubernetes and container orchestration (EKS/AKS/GKE)
  • Experience with distributed systems at scale
  • Familiarity with service meshes and microservices architectures

Nice To Haves

  • Experience with chaos engineering tools (Gremlin, Chaos Monkey)
  • Background in product-facing services with high traffic scale
  • Understand how to use incident management platforms. This includes using tools like PagerDuty for alerts. It also includes working with DataDog for monitoring.

Responsibilities

  • Engage and partner with various Engineering, Operations, and Product teams to design, deliver, and maintain a highly available and performant application platform.
  • Build and implement application observability and platform monitoring tools to continuously improve the customer experience
  • Eliminate toil by automating processes, tuning alerts, and improving code where it is most needed
  • Frequently evaluate new ideas and trends to identify potentially useful tools and techniques
  • Collaborate with different functional groups to identify gaps, prioritize, and resolve issues
  • Defining, implementing, and maintaining SLIs and SLOs aligned with customer experience.
  • Design and instrument SLIs such as latency, error rates, and availability across critical services
  • Manage and enforce error budgets to balance system reliability with product feature velocity.
  • Improving alert quality by reducing noise and focusing on actionable, high-signal alerts
  • Embed with product teams to review architectures and catch reliability risks early
  • Share your knowledge and experience with the Engineering organization
  • Share your findings with technical leadership and senior management
  • Build scripts in python/bash/java or ruby for operational automation and incident response

Benefits

  • comprehensive healthcare coverage
  • flexible PTO
  • equity RSUs
  • annual performance bonus opportunities
  • retirement account support
  • 14+ weeks of paid parental leave
  • career development opportunities
  • company-paid privacy certification exam fees
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service