Principal Cloud Platform Engineer

RCH SolutionsRadnor Township, PA
Remote

About The Position

RCH Solutions is seeking a Principal Cloud Platform Engineer with deep expertise in Kubernetes-based infrastructure to join our Cloud Engineering team. This role is ideal for individuals who take pride in designing, operating and evolving large-scale multi-tenant AI Platforms enabling real-world data and AI applications in the life sciences domain. This role is focused on platform-level engineering. You will own the reliability, scalability, and operational excellence of shared infrastructure supporting RAG-based AI workloads, with a strong emphasis on Kubernetes cluster operations and vector database systems. You'll collaborate closely with Data Engineers and AI Engineers and support them by providing a cloud-hosted scalable multi-tenant infrastructure platform.

Requirements

  • 5+ years hands-on background in high-scale platform engineering (internal platforms, PaaS, or shared infra)
  • Deep Kubernetes Platform Expertise
  • Hands-on experience with GKE: Cluster upgrades, node pool management, autoscaling; Managing failures, disruptions, and complex maintenance scenarios; RBAC, namespaces, network policies; GCP IAM, Workload Identity, Secret Manager; GCP Storage: BigQuery, GCS, Firestore
  • Terraform and IaaC experience with GitOps workflows (ArgoCD, Flux or equivalent)
  • Strong observability practices using: Google Cloud Operations Suite (Stackdriver); Prometheus / Grafana
  • Hands-on experience operating vector databases in production, ideally Weaviate: Query performance tuning; Cluster stability and scaling behavior
  • Solid understanding of distributed systems design and failure modes
  • Multi-zone / regional architectures
  • Google Cloud Load Balancing

Nice To Haves

  • Experience with Elasticsearch, OpenSearch, Azure AI Search or similar distributed search systems.
  • Experience with Vector DBs other than Weviate: Milvus, Pinecone, Qdrant or pgvector
  • Experience in designing, building and maintaining end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith: performance, latency, token usage, and alerting.
  • Exposure to GenAI platforms and LLM-based applications
  • Experience in Life Science domain.

Responsibilities

  • Design, operate, and continuously improve production-grade K8s clusters at the platform level.
  • Lead complex cluster lifecycle management, including version upgrades and dependency coordination, failure recovery and incident resolution, and non-trivial maintenance and system evolution.
  • Build and maintain highly reliable, scalable, multi-tenant infrastructure.
  • Build and maintain end-to-end observability for LLM-based systems using Grafana, LangFuse, and LangSmith — covering performance, latency, token usage, and alerting.
  • Architect and operate shared infrastructure across multiple teams and use cases.
  • Implement and enforce RBAC and access control models, tenant isolation and security boundaries, and resource management and fairness at scale.
  • Ensure platform stability under diverse and competing workloads.
  • Operate and optimize vector database systems (Weaviate preferred) in production environments.
  • Support and scale Retrieval-Augmented Generation (RAG) systems.
  • Drive improvements in query performance and latency, cluster tuning and resource efficiency, and operational stability of retrieval pipelines.
  • Take technical ownership of production systems over time.
  • Build and maintain strong practices in observability (metrics, logs, tracing), incident response and root cause analysis, and long-term system health and resilience.
  • Proactively identify and resolve reliability risks.
  • Work closely with backend and GenAI engineers to ensure seamless integration with the platform.
  • Contribute to a balanced team structure, with a strong infrastructure core and targeted application-layer support.

Benefits

  • A competitive salary and bonus package based on experience
  • Comprehensive health and wellness benefits, including Medical, Dental, and Vision Insurance
  • Company-provided Life and Long-Term Disability Insurance
  • Company-sponsored 401(k) Plan
  • Company-provided continuing education benefit
  • Team-focused culture and unlimited opportunity for advancement
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service