Earth Systems Modeling Operations Analyst

Science Systems & ApplicationsGreenbelt, MD
$75,000 - $105,000Onsite

About The Position

Science Systems and Applications, Inc (SSAI) is seeking an Operations Analyst to support the reliable and timely production of near real-time GEOS model products by monitoring operational workflows and executing approved workflow scripts within an HPC environment. This individual will ensure scheduled run cycles complete successfully and near real-time outputs are delivered according to operational timelines.

Requirements

  • Bachelor's Degree (B.S.) and a minimum of 2 years related experience and/or training, or equivalent combination of education and experience.
  • Specifically, 1-3 years of Earth System Modeling operations experience in a production environment with scheduled near real-time workloads.
  • Hands-on Linux operations and troubleshooting in production: Log review and diagnostics, Environment/module awareness, File system/storage space and permissions checks, Comfort using standard admin tools and CLIs for troubleshooting.
  • Experience using a job scheduler (Slurm, PBS, Cylc, or equivalent) for monitoring and operational troubleshooting (job states, dependencies, reruns, resource/time failures).
  • Demonstrated experience supporting shift/on-call responsibilities and responding to time-critical incidents.
  • Basic knowledge of scripting/programming for operations: bash/csh: workflow execution, wrapper scripts, log parsing, operational utilities, Python: simple tool development for status reporting, log parsing/QC automation, incident summaries, Perl: ability to maintain or extend existing operational scripts (at least to the level needed for troubleshooting and minor updates.
  • Workflow operations mindset with an ability to follow procedures precisely; in addition, practice safe recovery (e.g., when reruns are permitted, how to avoid data corruption or duplicate outputs).
  • Ability to inspect output presence, metadata, and perform basic sanity checks (e.g., netCDF/HDF5 familiarity at a practical level).
  • Strong attention to detail and ability to follow procedures under time pressure.
  • Clear communication during escalations (what failed, when, where, which logs/job IDs).
  • Team collaboration during cross-functional troubleshooting (operations ↔ science teams ↔ data providers and users).

Nice To Haves

  • Familiarity with numerical weather/climate operations concepts (cycles, near real-time product timing, typical failure modes).
  • Experience integrating lightweight monitoring/alerting (dashboards, alerts, automated status emails/messages).
  • Prior participation in incident management and structured post-incident review.
  • Running jobs in an HPC environment.
  • Sphinx: for operational document generation from rst (reStructuredText) files.

Responsibilities

  • Operate during scheduled shifts and participate in on-call rotation to support near real-time product generation.
  • Monitor workflow execution for operational cycles (e.g., data staging/ingest steps, model runs, post-processing, output archive, and product distribution).
  • Execute approved workflow scripts and operational commands according to operational procedures.
  • Monitor job status and system health in scheduling tools like Slurm, PBS , and Cylc (job states, failures, retries, dependencies) and confirm expected workflow progress.
  • Perform routine operational checks: Validate inputs/paths and confirm required inputs and dependencies exist, Inspect key logs for known error signatures, Run basic QC “sanity checks” on outputs as defined by operations procedures.
  • Diagnose issues at the workflow level (missing inputs, scheduler issues, environment/module mismatches, missing/corrupt inputs, storage/permission problems) and initiate recovery actions per operational procedures.
  • Escalate to model/system specialists when problems exceed operator scope; provide actionable incident details (error logs, job IDs, timestamps, impacted cycles).
  • Maintain operational documentation: Submit and update error tracking tickets, Update web-based documentation of operational procedures.
  • Coordinate with upstream data providers on data outages, file modifications, or network issues.
  • Coordinate with downstream product users/teams to ensure timely near real-time delivery and communicate delays or expected recovery timelines.
  • Note: Operators do not modify or develop the model, but they are responsible for workflow execution, monitoring, and recovery within their authorized procedures.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service