Cloud Platform Lead Consultant or Senior Consultant

Allstate•USA - TX (Remote), TX

5d•$100,000 - $170,500•Remote

About The Position

Arity is seeking a Cloud Platform Lead Consultant or Senior Consultant to join our Operational Data Management team within Engineering. This is a fully remote position. In this role, you will design, build, deploy, and operate cloud-native data infrastructure across AWS and Google Cloud Platform while bringing deep hands-on expertise in databases, data streaming, and distributed systems. You will ensure the platforms that ingest, store, and serve billions of miles of driving data remain resilient, observable, and cost-efficient—directly enabling Arity's products and the customers who rely on them to make smarter transportation decisions. The ideal candidate combines cloud engineering mastery with strong database and streaming fundamentals, advanced production-grade coding skills in Python, and demonstrated hands-on experience building AI agents and Model Context Protocol (MCP) servers to streamline DevOps workflows. A successful candidate rapidly adopts new technologies and delivers production-ready solutions with them, guides application developers on performance improvements and code-level fixes, and independently leads root cause analysis for complex production incidents.

Requirements

3-5 or more years of overall software engineering or infrastructure experience, with at least 2-4 years in site reliability engineering, DevOps, or platform engineering operating production systems at scale.
Demonstrated expertise designing, deploying, and managing cloud infrastructure on AWS and/or Google Cloud Platform, including networking, identity, and security fundamentals.
Strong hands-on experience with relational and NoSQL databases; production experience with PostgreSQL and at least one distributed database such as Apache Cassandra.
Production experience operating data streaming platforms; hands-on experience with Apache Kafka (including Amazon MSK) and a solid understanding of streaming fundamentals (partitions, consumer groups, delivery semantics, backpressure).
Advanced, production-grade Python and Shell scripting development skills, including writing, reviewing, and debugging application code, building custom automation tooling, and developing operational solutions that go well beyond basic scripting.
Strong experience with infrastructure-as-code (e.g., Terraform, Terraform Enterprise (TFE), Env0), Jenkins, Ansible, Git CI/CD pipelines and container orchestration (e.g., Kubernetes) in production environments.
Experience implementing and automating monitoring, logging, and alerting solutions for distributed systems (e.g., Prometheus, Grafana, CloudWatch, Datadog, or equivalent), including building automated runbooks and self-healing remediation workflows.
Proven track record independently leading root cause analysis for complex production incidents that span infrastructure, databases, streaming pipelines, and application code layers.

Nice To Haves

Production experience operating and troubleshooting Apache NiFi, including flow design, processor-level debugging, back-pressure configuration, cluster management, and contributing flow-level fixes and optimizations.
Hands-on operational experience with self-managed Apache Flink, including checkpoint management, state backend configuration, TaskManager memory tuning, job graph analysis, and application-level debugging of streaming jobs under backpressure.
Deepened experience with Apache Cassandra, PostgreSQL, DynamoDB, Amazon Redshift, ElastiCache and/or Google BigQuery in production environments.
Advanced experience with Apache Kafka, Apache Flink, Google Pub/Sub and operating streaming workloads across both AWS and GCP.
Experience administering or optimizing Starburst Galaxy, Trino, or AWS Athena for large-scale analytics workloads.
Experience building AI agents or Model Context Protocol (MCP) servers to automate DevOps, observability, or operational workflows.
Hands-on experience with large language models (LLMs), including fine-tuning, prompt engineering, RAG pipeline development, or training custom models for operational and DevOps use cases.
Familiarity with data pipeline orchestration tools (e.g., Apache Airflow, dbt) and event-driven architectures.
Experience troubleshooting and supporting applications deployed on enterprise PaaS platforms (e.g., Cloud Foundry, or equivalent) including understanding platform-level resource constraints, routing, and application lifecycle management.
Working proficiency in Golang sufficient to read production application code, interpret runtime behavior (goroutines, memory, pprof profiling), and contribute targeted performance fixes in collaboration with development teams.
AWS or Google Cloud professional-level certifications.
Experience with performance benchmarking, query plan analysis, and database capacity planning for high-throughput workloads.
Familiarity with application profiling, distributed tracing, and performance diagnostic tooling (e.g., APM, query analyzers, flame graphs) to isolate and resolve end-to-end latency issues.
Contributions to open-source database, streaming, or infrastructure projects.

Responsibilities

Design, deploy, and manage highly available database platforms including Apache Cassandra, PostgreSQL, Redis, Valkey, Amazon Redshift, and Google BigQuery across multi-cloud environments.
Build, operate, and optimize data streaming infrastructure using Amazon MSK (Kafka), Google Pub/Sub, and Apache Flink to support real-time and batch data pipelines.
Develop and maintain infrastructure-as-code, CI/CD pipelines, and cloud automation using Python and industry-standard tooling to enable repeatable, secure deployments.
Implement comprehensive monitoring, alerting, and observability for data platform services to proactively detect and resolve issues before they impact customers.
Partner with application development teams to troubleshoot, tune, and optimize application performance, query patterns, and data access layers backed by team-managed platforms.
Administer and optimize analytics and query engines including Starburst Galaxy and AWS Athena to deliver performant, cost-effective access to large-scale datasets.
Lead incident response, root cause analysis, and post-incident reviews for production database and streaming systems; drive remediation and preventive improvements.
Participate in an on-call rotation to provide 24x7 support for mission-critical data infrastructure.
Evaluate and adopt emerging technologies—including AI agents and MCP servers—to automate operational tasks, improve developer experience, and accelerate DevOps workflows.
Contribute to capacity planning, disaster recovery, security hardening, and cost optimization initiatives across the data platform estate.
Ability to review application source code, identify root causes of performance or reliability issues, and contribute targeted fixes or optimization guidance in collaboration with development teams.
Demonstrated ability to rapidly adopt unfamiliar technologies and deliver production-ready solutions within days to a week; strong self-directed learning with a track record of picking up new platforms, frameworks, and tools independently.
Proven ability to guide and advise software development teams on application-level performance tuning, query optimization, code-level improvements, and production troubleshooting—functioning as a technical authority on data platform usage patterns.
Strong understanding of distributed systems principles including high availability, fault tolerance, consistency models, and disaster recovery.
Excellent problem-solving, communication, and documentation skills with a track record of ownership in on-call and incident management environments.
Ability to read, debug, and analyze Java application code including Spring Boot and microservice frameworks; proficiency in JVM diagnostics including heap dump analysis, GC tuning, thread dump interpretation, and connection pool (e.g., HikariCP) troubleshooting.

Benefits

Comprehensive technology setup, including a laptop, monitors, headset, keyboard, and mouse.
Monthly connectivity reimbursement for employees eligible to work from home.
Opportunity to shape the future of protection.
Support for causes that mean the most to you.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume