ATMODAT Standard Compliance Checker#

This notebook introduces you to the atmodat checker which contains checks to ensure compliance with the ATMODAT Standard.

Its core functionality is based on the IOOS compliance checker. The ATMODAT Standard Compliance Checker library makes use of cc-yaml, which provides a plugin for the IOOS compliance checker that generates check suites from YAML descriptions. Furthermore, the Compliance Check Library is used as the basis to define generic, reusable compliance checks.

In addition, the compliance to the CF Conventions 1.4 or higher is verified with the CF checker.

In this notebook, you will learn

  • how to use an environment on DKRZ HPC mistral or levante

  • how to run checks with the atmodat data checker

  • to understand the results of the checker and further analyse it with pandas

  • how you could proceed to cure the data with xarray if it does not pass the QC

Preparation#

On DKRZ’s High-performance computer PC, we provide a conda environment which are useful for working with data in DKRZ’s CMIP Data Pool.

Option 1: Activate checker libraries for working with a comand-line shell

If you like to work with shell commands, you can simply activate the environment. Prior to this, you may have to load a module with a recent python interpreter

module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance

Option 2: Create a kernel with checker libraries to work with jupyter notebooks

With ipykernel you can install a kernel which can be used within a jupyter server like jupyterhub. ipykernel creates the kernel based on the activated environment.

module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
python -m ipykernel install --user --name qualitychecker --display-name="qualitychecker"

If you run this command from within a jupyter server, you have to restart the jupyterserver afterwards to be able to select the new quality checker kernel.

Expert mode: Running the jupyter server from a different environment than the environment in which atmodat is installed

Make sure that you:

  1. Install the cfunits package to the jupyter environment via conda install cfunits -c conda-forge -p $jupyterenv and restart the kernel.

  2. Add the atmodat environment to the PATH environment variable inside the notebook. Otherwise, the notebook’s shell does not find the application run_checks. You can modify environment variables with the os package and its command os.envrion. The environment of the kernel can be found with sys and sys.executable. The following block sets the environment variable PATH correctly:

import sys
import os
os.environ["PATH"]=os.environ["PATH"]+":"+os.path.sep.join(sys.executable.split('/')[:-1])
#As long as there is the installation bug, we have to manually get the Atmodat CVs:
if not "AtMoDat_CVs" in [dirpath.split(os.path.sep)[-1]
                         for (dirpath, dirs, files) in os.walk(os.path.sep.join(sys.executable.split('/')[:-2]))] :
    !git clone https://github.com/AtMoDat/AtMoDat_CVs.git {os.path.sep.join(sys.executable.split('/')[:-2])}/lib/python3.9/site-packages/atmodat_checklib/AtMoDat_CVs

Data to be checked#

In this tutorial, we will check a small subset of CMIP6 data which we gain via intake:

import intake
# Path to master catalog on the DKRZ server
col_url = "https://dkrz.de/s/intake"
col_url = "https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"
parent_col=intake.open_catalog([col_url])
list(parent_col)

# Open the catalog with the intake package and name it "col" as short for "collection"
col=parent_col["dkrz_cmip6_disk"]
# We just use the first file from the CMIP6 catalog and copy it to the local disk because we make some experiments from it
exp_file=col.df["uri"].values[0]
exp_file
'/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/BCC-ESM1/hist-piAer/r1i1p1f1/AERmon/c2h6/gn/v20200511/c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc'

Application#

The command run_checks can be executed from any directory from within the atmodat conda environment.

The atmodat checker contains two modules:

  • one that checks the global attributes for compliance with the ATMODAT standard

  • another that performs a standard CF check (building upon the cfchecks library).

Show usage instructions of run_checks

!run_checks -h
usage: run_checks [-h] [-v] [-op OPATH] [-cfv CFVERSION] [-check WHATCHECKS]
                  [-s] [-V] [-f FILE | -p PATH | -pnr PATH_NO_RECURSIVE]

Run the AtMoDat checks suits.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Print output of checkers (longer runtime due to double
                        call of checkers)
  -op OPATH, --opath OPATH
                        Define custom path where checker output shall be
                        written
  -cfv CFVERSION, --cfversion CFVERSION
                        Define custom CF table version against which the file
                        shall be checked. Valid are versions from 1.3 to 1.8.
                        Example: "-cfv 1.6". Default is 'auto'
  -check WHATCHECKS, --whatchecks WHATCHECKS
                        Define if AtMoDat or CF check or both shall be
                        executed. Valid options: AT, CF, both. Example:
                        "-check CF". Default is 'both'
  -s, --summary         Create summary of checker output
  -V, --version         show program's version number and exit
  -f FILE, --file FILE  Processes the given file
  -p PATH, --path PATH  Processes all files in a given path and subdirectories
                        (recursive file search)
  -pnr PATH_NO_RECURSIVE, --path_no_recursive PATH_NO_RECURSIVE
                        Processes all files in a given directory

The results of the performed checks are provided in the checker_output directory. By default, run_checks assumes writing permissions in the path where the atmodat checker is installed. If this is not the case, you must specify an output directory where you possess writing permissions with the -op output_path.

In the following block, we set the output path to the current working directory which we get via the bash command pwd. We apply run_checks for the exp_file which we downloaded in the chapter before.

cwd=!pwd
cwd=cwd[0]
!run_checks -f {exp_file} -op {cwd} -s
Running Compliance Checker on the datasets from: ['/work/ik1017/CMIP6/data/CMIP6/AerChemMIP/BCC/BCC-ESM1/hist-piAer/r1i1p1f1/AERmon/c2h6/gn/v20200511/c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc']
2023-07-05 14:33:44.523616 [INFO] :: PYESSV :: Loading vocabularies from /envs/lib/python3.11/site-packages/atmodat_checklib/AtMoDat_CVs/pyessv-archive:
2023-07-05 14:33:44.702117 [INFO] :: PYESSV :: ... loaded: atmodat
--- 13.0180 seconds for checking 1 files---

Now, we have a directory atmodat_checker_output in the op. For each run of run_checks, a new directory is created inside of op named by the timestamp. Additionally, a directory latest always shows the output of the most recent run.

!ls {os.path.sep.join([cwd, "atmodat_checker_output"])}
20230705_1433  latest

As we ran run_checks with the option -s, one output is the short_summary.txt file which we cat in the following:

output_dir_string=os.path.sep.join(["atmodat_checker_output","latest"])
output_path=os.path.sep.join([cwd, output_dir_string])
!cat {os.path.sep.join([output_path, "short_summary.txt"])}
=== Short summary === 
 
ATMODAT Standard Compliance Checker Version: 1.3.2
Checking against: ATMODAT Standard 3.0, CF Version 1.7
Checked at: 2023-07-05T14:33:55
 
Number of checked netCDF files: 1

Mandatory ATMODAT Standard checks passed: 4/4 (0 missing, 0 error(s))
Recommended ATMODAT Standard checks passed: 9/20 (11 missing, 0 error(s))
Optional ATMODAT Standard checks passed: 3/9 (6 missing, 0 error(s))

CF checker errors: 0 (Ignoring errors related to formula_terms in boundary variables. See Known Issues section https://github.com/AtMoDat/atmodat_data_checker#known-issues )
CF checker warnings: 2

Results#

The short summary contains information about versions, the timestamp of execution, the ratio of passed checks on attributes and errors written by the CF checker.

  • cfchecks routine only issues a warning/information message if variable metadata are completely missing.

  • Zero errors in the cfchecks routine does not necessarily mean that a data file is CF compliant!

We can also have a look into the detailled output including the exact error message in the long_summary_ files which are subdivided into severe levels.

!cat {os.path.sep.join([output_path,"long_summary_recommended.csv"])}
File,Check level,Global Attribute,Error Message

,,,

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,Conventions,ATMODAT Standard information not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,creator,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,crs,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_lat_resolution,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_lon_resolution,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,geospatial_vertical_resolution,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,keywords,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,product_version,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,source_type,global attribute value is invalid

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,standard_name_vocabulary,global attribute is not present

c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_185001-201412.nc,recommended,summary,global attribute is not present
!cat {os.path.sep.join([output_path,"long_summary_mandatory.csv"])}
File,Check level,Global Attribute,Error Message

,,,

We can open the .csv files with pandas to further analyse the output.

import pandas as pd
recommend_df=pd.read_csv(os.path.sep.join([output_path,"long_summary_recommended.csv"]))
recommend_df
File Check level Global Attribute Error Message
0 NaN NaN NaN NaN
1 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended Conventions ATMODAT Standard information not present
2 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended creator global attribute is not present
3 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended crs global attribute is not present
4 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended geospatial_lat_resolution global attribute is not present
5 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended geospatial_lon_resolution global attribute is not present
6 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended geospatial_vertical_resolution global attribute is not present
7 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended keywords global attribute is not present
8 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended product_version global attribute is not present
9 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended source_type global attribute value is invalid
10 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended standard_name_vocabulary global attribute is not present
11 c2h6_AERmon_BCC-ESM1_hist-piAer_r1i1p1f1_gn_18... recommended summary global attribute is not present

There may be missing global attributes wich are recommended by the atmodat standard. We can find them with pandas:

missing_recommend_atts=list(
    recommend_df.loc[recommend_df["Error Message"]=="global attribute is not present"]["Global Attribute"]
)
missing_recommend_atts
['creator',
 'crs',
 'geospatial_lat_resolution',
 'geospatial_lon_resolution',
 'geospatial_vertical_resolution',
 'keywords',
 'product_version',
 'standard_name_vocabulary',
 'summary']

Curation#

Let’s try first steps to cure the file by adding a missing attribute with xarray. We can open the file into an xarray dataset with:

import xarray as xr
exp_file_ds=xr.open_dataset(exp_file)
exp_file_ds
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lev: 26, lat: 64, lon: 128)
Coordinates:
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
  * lev        (lev) float64 0.9926 0.9706 0.9296 ... 0.01397 0.007389 0.003545
  * lat        (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86
  * lon        (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object ...
    lev_bnds   (lev, bnds) float64 ...
    p0         float64 ...
    a          (lev) float64 ...
    b          (lev) float64 ...
    ps         (time, lat, lon) float32 ...
    a_bnds     (lev, bnds) float64 ...
    b_bnds     (lev, bnds) float64 ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    c2h6       (time, lev, lat, lon) float32 ...
Attributes: (12/49)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            AerChemMIP
    branch_method:          Standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  2110.0
    comment:                The experiments parallel historical from 1850 to ...
    ...                     ...
    title:                  BCC-ESM1 output prepared for CMIP6
    tracking_id:            hdl:21.14100/7be29ebc-8b8a-4fda-95e9-ac1dc8b3da8c
    variable_id:            c2h6
    variant_label:          r1i1p1f1
    license:                CMIP6 model data produced by BCC is licensed unde...
    cmor_version:           3.3.2

We can handle and add attributes via the dict-type attribute .attrs. Applied on the dataset, it shows all global attributes of the file:

exp_file_ds.attrs
{'Conventions': 'CF-1.7 CMIP-6.2',
 'activity_id': 'AerChemMIP',
 'branch_method': 'Standard',
 'branch_time_in_child': 0.0,
 'branch_time_in_parent': 2110.0,
 'comment': 'The experiments parallel historical from 1850 to 2014 with all forcing applied, but fix the anthropogenic emissions of Aerosol precursors to the 1850 value that is used in piControl. The same initial conditions as r1i1p1f1 of historical, branched from year 2110 in piControl.',
 'contact': 'Dr. Tongwen Wu(twwu@cma.gov.cn)',
 'creation_date': '2020-05-11T06:54:48Z',
 'data_specs_version': '01.00.27',
 'description': 'AerChemMIP:hist-piAer',
 'experiment': 'historical forcing, but with pre-industrial aerosol emissions',
 'experiment_id': 'hist-piAer',
 'external_variables': 'areacella',
 'forcing_index': 1,
 'frequency': 'mon',
 'further_info_url': 'https://furtherinfo.es-doc.org/CMIP6.BCC.BCC-ESM1.hist-piAer.none.r1i1p1f1',
 'grid': 'T42',
 'grid_label': 'gn',
 'history': '2020-05-11T06:54:48Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.',
 'initialization_index': 1,
 'institution': 'Beijing Climate Center, Beijing 100081, China',
 'institution_id': 'BCC',
 'mip_era': 'CMIP6',
 'nominal_resolution': '250 km',
 'parent_activity_id': 'CMIP',
 'parent_experiment_id': 'piControl',
 'parent_mip_era': 'CMIP6',
 'parent_source_id': 'BCC-ESM1',
 'parent_time_units': 'days since 1850-01-01',
 'parent_variant_label': 'r1i1p1f1',
 'physics_index': 1,
 'product': 'model-output',
 'realization_index': 1,
 'realm': 'aerosol',
 'references': 'Model described by Tongwen Wu et al. (JGR 2013; JMR 2014; GMD,2019). Also see http://forecast.bcccsm.ncc-cma.net/htm',
 'run_variant': 'forcing: greenhouse gases,aerosol emission,solar constant,volcano mass,land use',
 'source': 'BCC-ESM 1 (2017):   aerosol: none  atmos: BCC_AGCM3_LR (T42; 128 x 64 longitude/latitude; 26 levels; top level 2.19 hPa)  atmosChem: BCC-AGCM3-Chem  land: BCC_AVIM2  landIce: none  ocean: MOM4 (1/3 deg 10S-10N, 1/3-1 deg 10-30 N/S, and 1 deg in high latitudes; 360 x 232 longitude/latitude; 40 levels; top grid cell 0-10 m)  ocnBgchem: none  seaIce: SIS2',
 'source_id': 'BCC-ESM1',
 'source_type': 'AER AOGCM CHEM',
 'sub_experiment': 'none',
 'sub_experiment_id': 'none',
 'table_id': 'AERmon',
 'table_info': 'Creation Date:(30 July 2018) MD5:e53ff52009d0b97d9d867dc12b6096c7',
 'title': 'BCC-ESM1 output prepared for CMIP6',
 'tracking_id': 'hdl:21.14100/7be29ebc-8b8a-4fda-95e9-ac1dc8b3da8c',
 'variable_id': 'c2h6',
 'variant_label': 'r1i1p1f1',
 'license': 'CMIP6 model data produced by BCC is licensed under a Creative Commons Attribution ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at https:///pcmdi.llnl.gov/. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law.',
 'cmor_version': '3.3.2'}

We add all missing attributes and set a dummy value for them:

for att in missing_recommend_atts:
    exp_file_ds.attrs[att]="Dummy"

We save the modified dataset with the to_netcdf function:

exp_file_ds.to_netcdf("testfile-modified.nc")

Now, lets run run_checks again.

We can also only provide a directory instead of a file as an argument with the option -p. The checker will find all .nc files inside that directory.

!run_checks -p {cwd} -op {cwd} -s
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/testfile-modified.nc
2023-07-05 14:34:52.550854 [INFO] :: PYESSV :: Loading vocabularies from /envs/lib/python3.11/site-packages/atmodat_checklib/AtMoDat_CVs/pyessv-archive:
2023-07-05 14:34:52.557105 [INFO] :: PYESSV :: ... loaded: atmodat
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_CCLM.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_T_2M.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_T_2M_celsius.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_T_3M.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_celsius.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_gridinfo_CCLM4-8-17.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_interface.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_temp_day.nc
Running Compliance Checker on the dataset from: /builds/data-infrastructure-services/tutorials-and-use-cases/docs/source/cdo-incl-cmor/application/handson/example_temp_mon.nc
--- 7.3080 seconds for checking 10 files---

Using the latest directory, here is the new summary:

!cat {os.path.sep.join([output_path,"short_summary.txt"])}
=== Short summary === 
 
ATMODAT Standard Compliance Checker Version: 1.3.2
Checking against: ATMODAT Standard 3.0, multiple CF versions (CF-1.7, CF-1.6, CF-1.0, CF-1.4)
Checked at: 2023-07-05T14:34:59
 
Number of checked netCDF files: 10

Mandatory ATMODAT Standard checks passed: 26/40 (13 missing, 1 error(s))
Recommended ATMODAT Standard checks passed: 33/200 (167 missing, 0 error(s))
Optional ATMODAT Standard checks passed: 6/90 (84 missing, 0 error(s))

CF checker errors: 23 (Ignoring errors related to formula_terms in boundary variables. See Known Issues section https://github.com/AtMoDat/atmodat_data_checker#known-issues )
CF checker warnings: 14

You can see that the checks do not fail for the modified file when subtracting the earlier failes from the sum of new passed checks.