# Intake IV - preprocessing and derived variables

```{admonition} Overview
:class: dropdown

![Level](https://img.shields.io/badge/Level-expert-red.svg)


üéØ **objectives**: Learn how to integrate `intake-esm` in your workflow

‚åõ **time_estimation**: "30min"

‚òëÔ∏è **requirements**: `intake_esm.__version__ == 2023.4.*`, at least 10GB memory.


- intake I

¬© **contributors**: k204210

‚öñ **license**:

```

```{admonition} Agenda
:class: tip

Based on DKRZ's CMIP6 catalog, you learn in this part how to

1. [add a **preprocessing** to `to_dataset_dict()`](#preprocess)
1. [create a derived variable registry](#derived)
```

In [None]:
import intake
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk

<a class="anchor" id="preprocessing"></a>

## Use Preprocessing when opening assets and creating datasets
 
When calling intake-esm's `to_dataset_dict` function, we can pass an argument **preprocess**. Its value should be a function which is applied to all assets before they are opened.

```{note}
For CMIP6, a [preprocessing package](https://github.com/jbusecke/cmip6_preprocessing) has been developped for homogenizing and preparing datasets of different ESMs for a grand analysis featuring

- renaming and setting of coordinates
- adjusting grid values to fit into a common range (0-360 for lon)
```

E.g., if you would like to set some specific variables as coordinates, you can define a [function](https://github.com/jbusecke/cmip6_preprocessing/blob/209041a965984c2dc283dd98188def1dea4c17b3/cmip6_preprocessing/preprocessing.py#L239) which

- receives an xarray dataset as an argument
- returns a new xarray dataset

In [None]:
def correct_coordinates(ds) :
    """converts wrongly assigned data_vars to coordinates"""
    ds = ds.copy()
    for co in [
        "x",
        "y",
        "lon",
        "lat",
        "lev",
        "bnds",
        "lev_bounds",
        "lon_bounds",
        "lat_bounds",
        "time_bounds",
        "lat_verticies",
        "lon_verticies",
    ]:
        if co in ds.variables:
            ds = ds.set_coords(co)
    return ds

Now, when you open the dataset dictionary, you provide it for *preprocess*:

In [None]:
cat=esm_dkrz.search(variable_id="tas",
                   table_id="Amon",
                   source_id="MPI-ESM1-2-HR",
                   member_id="r1i1p1f1",
                   experiment_id="ssp370"
                  )
test_dsets=cat.to_dataset_dict(
    zarr_kwargs={"consolidated":True},
    cdf_kwargs={"chunks":{"time":1}},
    preprocess=correct_coordinates
)

<a class="anchor" id="derived"></a>

## Derived variables

Most of the following is taken from the [intake-esm tutorial](https://intake-esm.readthedocs.io/en/latest/how-to/define-and-use-derived-variable-registry.html).

> A ‚Äúderived variable‚Äù in this case is a variable that doesn‚Äôt itself exist in an intake-esm catalog, but can be computed (i.e., ‚Äúderived‚Äù) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset. [...] Derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.

The registry of the derived variables can be connected to the catalog. When users open

In [None]:
import intake
import intake_esm

In [None]:
from intake_esm import DerivedVariableRegistry

```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)

- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.

```