Intake II - modifying intake-esm data bases and save new catalogs#

Agenda

Based on DKRZ’s CMIP6 catalog, you learn in this part, how to

  1. Modify the data base of the catalog

    • How to rename values in a column

  2. Make complex searches

    • Which member was produced the most?

  3. Save subset catalogs

The data base of intake-esm catalogs is processed as a pandas DataFrame. In section 1 and 2, you will learn to modify this data base and make complex searches with pandas commands. The tutorial covers some examples - if you aim at a deeper understanding of pandas we recommend you to do the extensive pandas tutorials from its own documentation.

Note

Pandas is a powerful data analysis tool and with its DataFrame class, users are enabled to process table-like data fast and intuitively.

import intake
#dkrz_cdp=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk

Modify the data base#

Assume you want to rename a short name like tas into a long name like temperature. We define a dictionary for renaming:

rename_dict={"tas":"Temperature",
             "pr":"Precipitation"}

For all items in the rename dictionary, we will reset the value in the underlying DataFrame inplace. We iterate over the dictionary with .items which returns key and value separately for all entries in the dictionary.

With the .loc attribute of the catalog, we can access a slice inside the DataFrame. The first argument of loc is the row indexer, the second is the column indexer.

  • The row index condition is: all rows where the variable is as the key of our dictionary, e.g. variable_id == tas. In general terms, that is esm_dkrz.df["variable_id"]==short_name

  • The column index is: easy, just variable_id

Therefore, our code looks like:

for short_name,long_name in rename_dict.items():
    esm_dkrz.df.loc[esm_dkrz.df["variable_id"]==short_name, "variable_id"]=long_name

Now, you can search “Temperature”:

esm_dkrz.search(variable_id="Temperature")

/work/ik1017/Catalogs/dkrz_cmip6_disk catalog with 2687 dataset(s) from 140620 asset(s):

unique
activity_id 19
institution_id 36
source_id 91
experiment_id 208
member_id 425
table_id 9
variable_id 1
grid_label 7
dcpp_init_year 63
version 544
time_range 15866
project 1
format 1
uri 140620
derived_variable_id 0

Warning

In this example, we changed variable_ids which are predefined by the CMIP6 data standard for good reasons. Do NOT consider renaming as good practice. Do NOT share a catalog with renamed variables.

Complex searches, e.g. for combinations of attributes:#

If you want to know what the “preferred” ensemble member assignment of one specific or all ESMs is, you can do that by grouping the underlying data base with groupby. It takes columns as arguments which are used for grouping. I.e., groupby creates a new combined index with all unique combination of the specified columns. We have to save the returned object as we continue to work with the grouped dataframe:

grouped_df=esm_dkrz.df.groupby(["source_id","member_id"])

Note

if you return the underlying df variable, it will not be in the context of the catalog any longer and instead, it will be only the DataFrame which corresponds to the data base of the catalog

The returned grouped_df is only the start of an operation. Pandas has named the workflow split-apply-combine. We started to split the dataframe into groups.

  • apply: We will calculate now the number of entries for all groups, which can be easily done with size(). size() creates a new column for all groups named “counts” which contains the results.

  • combine: Afterwards, we will reset the index. That means, that the columns which were used for the groups and which were indexes in the grouped DataFrame return to be columns. The return of reset_index is a regular DataFrame.

grouped_df_size=grouped_df.size().reset_index(name='counts')
grouped_df_size
source_id member_id counts
0 ACCESS-CM2 r1i1p1f1 9742
1 ACCESS-CM2 r2i1p1f1 1339
2 ACCESS-CM2 r3i1p1f1 966
3 ACCESS-CM2 r4i1p1f1 21
4 ACCESS-ESM1-5 r10i1p1f1 670
... ... ... ...
1637 UKESM1-0-LL r6i1p1f3 603
1638 UKESM1-0-LL r7i1p1f2 148
1639 UKESM1-0-LL r7i1p1f3 1305
1640 UKESM1-0-LL r8i1p1f2 8789
1641 UKESM1-0-LL r9i1p1f2 1587

1642 rows × 3 columns

With that DataFrame, we can already display statistics of ensemble member for specific sources. If we would like to know the occurences of specific ensemble member of the source MPI-ESM1-2-HR only, we can subselect this source from the data very easily. Our condition grouped_df_size["source_id"]=="MPI-ESM1-2-HR" is just put in brackets of the Dataframe:

grouped_df_size_mpi=grouped_df_size[grouped_df_size["source_id"]=="MPI-ESM1-2-HR"]

The DataFrame has Plot options which allows us to directly create a figure of our data. One nice way to plot the counts of ensemble member for MPI-ESM1-2-HR is to use the bar() plot:

Note

We will use hvplot because it can create interactive figures. The same plot can be created wihtout the .hvplot part of the command. You can also check for other plot plugins of Pandas.

import hvplot.pandas
grouped_df_size_mpi.hvplot.bar(x="member_id")

But how is it in general for all sources? For that, we will do another split-apply-combine to get our target measure:

  • split: for all unique member (groupby("member_id")),…

  • apply: … we calculate the mean counts (mean("counts"))

  • combine: and sort the values with sort_values("counts", ascending=False)

We can do that in one line:

grouped_df_member=grouped_df_size.groupby("member_id").mean("counts").sort_values("counts", ascending=False)

As there are thousands of unique ensemble member values, we should only plot some, e.g. the top 10:

import hvplot.pandas
grouped_df_member.iloc[0:10,:].hvplot.bar()

Save a catalog subset as a new catalog#

Tip

We highly recommend that you save the subset of the catalog which you use in your analysis. Catalogs are often not as stable as they should be. With a local copy, you can ensure whether the original source has changed.

If we want to save our subset catalog which only contains metadata, we save the search result:

esm_subset=esm_dkrz.search(variable_id="Temperature")

Secondly, intake esm comes up with the serialize() function. The only argument is the name of the catalog which will be used as filename. It writes the two parts of the catalog either together in a .json file:

esm_subset.serialize("esm_subset")
Successfully wrote ESM catalog json file to: file:///builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/esm_subset.json

Or in two seperated files if we provide catalog_type=file as a second argument. The test.json has over 100 MB. If we save the data base in a separate .csv.gz file, we reduce that to 2MB:

esm_subset.serialize("esm_subset", catalog_type="file")
Successfully wrote ESM catalog json file to: file:///builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/esm_subset.json

Now, we can open the catalog from disk:

intake.open_esm_datastore("esm_subset.json")

esm_subset catalog with 2687 dataset(s) from 140620 asset(s):

unique
activity_id 19
institution_id 36
source_id 91
experiment_id 208
member_id 425
table_id 9
variable_id 1
grid_label 7
dcpp_init_year 63
version 544
time_range 15866
project 1
format 1
uri 140620
derived_variable_id 0