Intake IV - preprocessing and derived variables#


Based on DKRZ’s CMIP6 catalog, you learn in this part how to

  1. add a preprocessing to to_dataset_dict()

  2. create a derived variable registry

import intake
Use Preprocessing when opening assets and creating datasets#

When calling intake-esm’s to_dataset_dict function, we can pass an argument preprocess. Its value should be a function which is applied to all assets before they are opened.


For CMIP6, a preprocessing package has been developped for homogenizing and preparing datasets of different ESMs for a grand analysis featuring

  • renaming and setting of coordinates

  • adjusting grid values to fit into a common range (0-360 for lon)

E.g., if you would like to set some specific variables as coordinates, you can define a function which

  • receives an xarray dataset as an argument

  • returns a new xarray dataset

def correct_coordinates(ds) :
    """converts wrongly assigned data_vars to coordinates"""
    ds = ds.copy()
    for co in [
        if co in ds.variables:
            ds = ds.set_coords(co)
    return ds

Now, when you open the dataset dictionary, you provide it for preprocess:
--> The keys in the returned dictionary of datasets are constructed as follows:
Derived variables#

Most of the following is taken from the intake-esm tutorial.

A “derived variable” in this case is a variable that doesn’t itself exist in an intake-esm catalog, but can be computed (i.e., “derived”) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset. […] Derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.

The registry of the derived variables can be connected to the catalog. When users open

import intake
import intake_esm
from intake_esm import DerivedVariableRegistry