🗺️ Recipes#

Instead of running long, complex CLI commands every time you want to build a dataset, fetchez allows you to define your entire workflow in a YAML file called a Recipe.

By treating your data pipelines as Infrastructure as Code, you ensure your data pulls are perfectly reproducible, auditable, sharable..

How to Launch a Recipe#

Recipes are written in standard YAML. To execute a recipe and start fetching data, simply pass the YAML file to the fetchez CLI:

fetchez recipes/my_archive_project.yaml

Alternatively, you can load and launch recipes directly within a Python driver script using the fetchez.recipe API:

from fetchez.recipe import Recipe

# Load the engine with your recipe and launch
Recipe.from_file("recipes/my_archive_project.yaml").run()

Anatomy of a Recipe#

A fetchez YAML configuration is broken down into specific operational blocks. Here is a generalized structure for a project that downloads Topography and Boundary data, unzips it, and audits the result:

1. Project & Execution Metadata#

Defines what you are building and how much compute power to use.

project:
  name: "Miami_Coastal_Data"
  description: "Pulling raw shapefiles and TIFFs for local analysis."

execution:
  threads: 4 # Number of parallel download streams

region: [-80.5, -80.0, 25.5, 26.0] # The bounding box: [West, East, South, North]

2. Modules (The Data Sources)#

The modules block lists the data sources fetchez will query and ingest. Modules are evaluated in order.

modules:
  # Download NOAA Nautical Charts
  - module: charts
    hooks:
      # These hooks ONLY apply to charts data
      - name: unzip
        args:
          remove: true # Delete the .zip after extracting

  # Download Copernicus Topography
  - module: copernicus
    args:
      datatype: "1" # COP-30
    hooks:
      - name: checksum
        args:
          algo: "sha256"

  # Seamlessly include local data in the pipeline!
  - module: local_fs
    args:
      path: "../local_surveys/field_notes/"
      ext: ".csv"

3. Global Hooks (The Assembly Line)#

The global_hooks block defines the processing pipeline. While module hooks only touch specific data, Global Hooks process the combined pool of data from all modules.

global_hooks:
  # Runs after ALL downloads and unzipping are finished
  - name: audit
    args:
      file: "miami_data_audit.json"

Understanding Hooks and the Lifecycle#

Hooks are the specialized tools that intercept and process your data. It is critical to understand when they run. fetchez processes hooks in three distinct stages:

PRE Stage: Runs before downloads begin.#

*Use case:* Filtering the list of URLs based on regex, limiting the maximum number of files to download, or authenticating tokens.

FILE Stage: Runs during the download loop on each individual file.#

*Use case:* Unzipping archives immediately as they arrive, verifying checksums, or piping the file path to standard output.

POST Stage: Runs after all files have been downloaded and processed.#

*Use case:* Generating a JSON audit log, zipping the final output directory into a clean tarball, or sending a Slack notification that the job is done.

Global vs. Module Hooks#

Module Hooks (modules.hooks): Only execute on the files fetched by that specific module. For example, you might only want to run the unzip hook on USGS data, but leave Copernicus files as tarballs.
Global Hooks (global_hooks): Execute on the entire, aggregated dataset from all modules simultaneously.