Prepare modeling data for GIS analyses#

The most common way to work with climate modeling data sets is storing them in NetCDF or GRIB format and processing them using Xarray. Raster data sets used in GIS applications are usually stored in raster formats such as GeoTIFF, BIL, or IMG and being processed mainly with the help of GDAL-based python libraries. While modeling data sets are normally georeferenced using grids explicitly described by coordinates and conforming to e.g., CF sandard, geographic reference of GIS data sets is described by a coordinate reference system (CRS). This diferences imply some steps that need to be undertaken in order to convert data sets from one format to another.

In this notebook we create an aggregated map of pre-industrial Vertically Averaged Sea Water Potential Temperature (thetaot) from CMIP6 and save it as GeoTIFF.

First, we import all necessary libraries.

%matplotlib inline
import intake, requests, aiohttp
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import numpy as np
import xarray as xr
import cf_xarray as cfxr
import xesmf as xe
import scipy.sparse as sps

# Rioxarray is an experimental interface between Xarray and Rasterio libraries
# made for loading GDAL-readable rasters as xarray datasets and saving xarrays as raster files.
import rioxarray
#print("Using roocs/clisops in version %s" % cl.__version__)
print("Using xESMF in version %s" % xe.__version__)

xr.set_options(display_style = "html");

import warnings
warnings.simplefilter("ignore") 
ERROR 1: PROJ: proj_create_from_database: Open of /envs/share/proj failed
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /envs/lib/python3.11/site-packages/esmpy/interface/loadESMF.py:26
     25 try:
---> 26     esmfmk = os.environ["ESMFMKFILE"]
     27 except:

File <frozen os>:679, in __getitem__(self, key)

KeyError: 'ESMFMKFILE'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
File /envs/lib/python3.11/site-packages/xesmf/backend.py:22
     21 try:
---> 22     import esmpy as ESMF
     23 except ImportError:

File /envs/lib/python3.11/site-packages/esmpy/__init__.py:112
    110 #### IMPORT LIBRARIES #########################################################
--> 112 from esmpy.api.esmpymanager import *
    113 from esmpy.api.grid import *

File /envs/lib/python3.11/site-packages/esmpy/api/esmpymanager.py:9
      7 #### IMPORT LIBRARIES #########################################################
----> 9 from esmpy.interface.cbindings import *
     11 from esmpy.api.constants import *

File /envs/lib/python3.11/site-packages/esmpy/interface/cbindings.py:13
     12 from esmpy.util.decorators import *
---> 13 from esmpy.interface.loadESMF import _ESMF
     16 def copy_struct(src):

File /envs/lib/python3.11/site-packages/esmpy/interface/loadESMF.py:28
     27 except:
---> 28     raise ImportError('The ESMFMKFILE environment variable is not available.')
     30 #### INVESTIGATE esmf.mk ######################################################
     31 
     32 # TODO: look for various dependecies in the ESMF build log
   (...)
     37 #       use this information to set variables that can be checked at beginning
     38 #       of the routines that require an ESMF build with these dependencies

ImportError: The ESMFMKFILE environment variable is not available.

During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 8
      6 import xarray as xr
      7 import cf_xarray as cfxr
----> 8 import xesmf as xe
      9 import scipy.sparse as sps
     11 # Rioxarray is an experimental interface between Xarray and Rasterio libraries
     12 # made for loading GDAL-readable rasters as xarray datasets and saving xarrays as raster files.

File /envs/lib/python3.11/site-packages/xesmf/__init__.py:4
      1 # flake8: noqa
      3 from . import data, util
----> 4 from .frontend import Regridder, SpatialAverager
      6 try:
      7     from ._version import __version__

File /envs/lib/python3.11/site-packages/xesmf/frontend.py:12
      9 import xarray as xr
     10 from xarray import DataArray, Dataset
---> 12 from .backend import Grid, LocStream, Mesh, add_corner, esmf_regrid_build, esmf_regrid_finalize
     13 from .smm import (
     14     _combine_weight_multipoly,
     15     _parse_coords_and_values,
   (...)
     19     read_weights,
     20 )
     21 from .util import LAT_CF_ATTRS, LON_CF_ATTRS, split_polygons_and_holes

File /envs/lib/python3.11/site-packages/xesmf/backend.py:24
     22     import esmpy as ESMF
     23 except ImportError:
---> 24     import ESMF
     25 import numpy as np
     26 import numpy.lib.recfunctions as nprec

ModuleNotFoundError: No module named 'ESMF'

1. Find and load the data set#

Then we find the data set we need in the DKRZ open catalogue and print its metadata.

# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
# Print DKRZ open catalogues
list(dkrz_catalog)
['dkrz_cmip5_archive',
 'dkrz_cmip5_disk',
 'dkrz_cmip6_cloud',
 'dkrz_cmip6_disk',
 'dkrz_cordex_disk',
 'dkrz_dyamond-winter_disk',
 'dkrz_era5_disk',
 'dkrz_monsoon_disk',
 'dkrz_mpige_disk',
 'dkrz_nextgems_disk',
 'dkrz_palmod2_disk']
# Open the catalog with the intake package and name it "col" as short for "collection"
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["read_csv_kwargs"]["usecols"]+["opendap_url"]
col=dkrz_catalog.dkrz_cmip6_disk(read_csv_kwargs=dict(usecols=cols))
cat = col.search(variable_id = "thetaot",
                 experiment_id = "historical",
                time_range = "185001-194912",
                member_id = "r1i1p1f3"
                )
cat.df
activity_id institution_id source_id experiment_id member_id table_id variable_id grid_label dcpp_init_year version time_range opendap_url project format uri
0 CMIP MOHC HadGEM3-GC31-LL historical r1i1p1f3 Emon thetaot gn NaN v20200210 185001-194912 http://esgf3.dkrz.de/thredds/dodsC/cmip6/CMIP/... CMIP6 netcdf /work/ik1017/CMIP6/data/CMIP6/CMIP/MOHC/HadGEM...
dset = xr.open_dataset(cat.df.uri.values[0], chunks = {"time": 6})
dset["thetaot"]
<xarray.DataArray 'thetaot' (time: 1200, j: 330, i: 360)>
dask.array<open_dataset-thetaot, shape=(1200, 330, 360), dtype=float32, chunksize=(6, 330, 360), chunktype=numpy.ndarray>
Coordinates:
  * time       (time) object 1850-01-16 00:00:00 ... 1949-12-16 00:00:00
  * j          (j) int32 0 1 2 3 4 5 6 7 8 ... 322 323 324 325 326 327 328 329
  * i          (i) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
    latitude   (j, i) float32 dask.array<chunksize=(330, 360), meta=np.ndarray>
    longitude  (j, i) float32 dask.array<chunksize=(330, 360), meta=np.ndarray>
Attributes:
    standard_name:  sea_water_potential_temperature
    long_name:      Vertically Averaged Sea Water Potential Temperature
    comment:        Vertical average of the sea water potential temperature t...
    units:          degC
    original_name:  mo: level_mean((variable_name: thetao), (variable_name: t...
    cell_methods:   area: depth: time: mean
    cell_measures:  area: areacello

As we can see, coordinates are stored as arrays of grid cell indices, which is not a common format in the GIS community.

2. Calculate pre-industrail mean#

We make a subset limiting the data set to pre-industrial era, calculate the mean values per grid cell, and plot the result.

dset["time"] = dset["time"].dt.strftime("%Y%m%d")
pre_industrial = dset["thetaot"].sel(time = slice("1850", "1880"))
pre_industrial_mean = pre_industrial.mean(dim = "time", keep_attrs = True)
pre_industrial_mean.plot() 
<matplotlib.collections.QuadMesh at 0x7f4933724cd0>
_images/cc84557df5952ec2861d423f0a925f69bf16136225f1007f137ef06240afa108.png

If we plot the coastlines from cartopy over the dataset, we will see that the coordinate systems of these datasets are… a bit different.

fig = plt.figure(1, figsize = [30, 13])

# Set the projection to use for plotting
ax = plt.subplot(1, 1, 1, projection=ccrs.PlateCarree())
ax.coastlines()

# Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting
# isel allows to select by indices instead of the time values
pre_industrial_mean.plot.pcolormesh(ax = ax, cmap = "coolwarm")
<cartopy.mpl.geocollection.GeoQuadMesh at 0x7f49305aa990>
_images/3a11bfb2aa916d87ab69d5217cfcf342418d274db320caca9952da9a317c6d31.png

If we visualize our data set’s grid, it becomes obvious that it is far from a standard grid.

import textwrap
def plot_curv_grid(ds, var = "tos"):    
        lat = cfxr.accessor._get_with_standard_name(ds, "latitude")[0]
        lon = cfxr.accessor._get_with_standard_name(ds, "longitude")[0]        
        if any([i == None for i in [lat, lon]]): 
            print(ds.attrs["source_id"], ": Cannot identify latitude/longitude.")
            return
        plt.figure(figsize = (16, 9), dpi=120)
        plt.scatter(x = ds[lon], y = ds[lat], s = 0.01)
        # x, y = np.meshgrid(ds[lon], ds[lat])
        # plt.scatter(x = x, y = y, s = 0.01)
        try:
            plt.title("\n".join(textwrap.wrap(ds.attrs["source_id"] + "(" + ds.attrs["source"].split("ocean:")[-1].split("\n")[0] + ")", 120)))
        except (KeyError, IndexError):
            plt.title(ds.attrs["source_id"])  
plot_curv_grid(dset)
_images/11ecabf34aae4ab506fb92de57b0b7a9046e14f2e9d69bb4db0b4e0e81efbf06.png

3. Reformat the grid and regridd the data set.#

As we mentioned before, the coordinates need to be re-defined.

# Later one can just install and use clisops, once functions like this are merged into the master branch
def grid_reformat(ds):

    # Define lat, lon, lat_bnds, lon_bnds
    lat = ds.lat[:, 0]
    lon = ds.lon[0, :]
    lat_bnds = np.zeros((lat.shape[0], 2), dtype = "double")
    lon_bnds = np.zeros((lon.shape[0], 2), dtype = "double")

    # From (N + 1, M + 1) shaped bounds to (N, M, 4) shaped vertices
    lat_vertices = cfxr.vertices_to_bounds(
        ds.lat_b, ("bnds", "lat", "lon")
    ).values
    lon_vertices = cfxr.vertices_to_bounds(
        ds.lon_b, ("bnds", "lat", "lon")
    ).values

    lat_vertices = np.moveaxis(lat_vertices, 0, -1)
    lon_vertices = np.moveaxis(lon_vertices, 0, -1)

    # From (N, M, 4) shaped vertices to (N, 2)  and (M, 2) shaped bounds
    lat_bnds[:, 0] = np.min(lat_vertices[:, 0, :], axis = 1)
    lat_bnds[:, 1] = np.max(lat_vertices[:, 0, :], axis = 1)
    lon_bnds[:, 0] = np.min(lon_vertices[0, :, :], axis = 1)
    lon_bnds[:, 1] = np.max(lon_vertices[0, :, :], axis = 1)

    # Create dataset
    ds_ref = xr.Dataset(
        coords = {
            "lat": (["lat"], lat.data),
            "lon": (["lon"], lon.data),
            "lat_bnds": (["lat", "bnds"], lat_bnds.data),
            "lon_bnds": (["lon", "bnds"], lon_bnds.data),
        },
    )


    # Add variable attributes to the coordinate variables
    ds_ref["lat"].attrs = {
        "bounds": "lat_bnds",
        "units": "degrees_north",
        "long_name": "latitude",
        "standard_name": "latitude",
        "axis": "Y",
    }
    ds_ref["lon"].attrs = {
        "bounds": "lon_bnds",
        "units": "degrees_east",
        "long_name": "longitude",
        "standard_name": "longitude",
        "axis": "X",
    }
    ds_ref["lat_bnds"].attrs = {
        "long_name": "latitude_bounds",
        "units": "degrees_north",
    }
    ds_ref["lon_bnds"].attrs = {
        "long_name": "longitude_bounds",
        "units": "degrees_east",
    }
    
    return ds_ref
# Specify a global 1deg target grid

# With clisops
#ds_out = clore.Grid(grid_instructor=(1., )).ds

# Without clisops
ds_out = grid_reformat(xe.util.grid_global(1., 1.))

ds_out
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 7
      1 # Specify a global 1deg target grid
      2 
      3 # With clisops
      4 #ds_out = clore.Grid(grid_instructor=(1., )).ds
      5 
      6 # Without clisops
----> 7 ds_out = grid_reformat(xe.util.grid_global(1., 1.))
      9 ds_out

NameError: name 'xe' is not defined

And now we regrid the data set to a normal 1-degree global grid.

# In case of problems, activate ESMF verbose mode
import ESMF
ESMF.Manager(debug = True)

# Regridding methods
#method_list = ["bilinear", "nearest_s2d", "patch", "conservative", "conservative_normed"]
method_list = ["bilinear", "nearest_s2d", "patch"]

# Function to generate the weights
#   If grids have problems of degenerated cells near the poles there is the ignore_degenerate option
def regrid(ds_in, ds_out, method, periodic, unmapped_to_nan = True, ignore_degenerate = None):
    """Convenience function for calculating regridding weights"""
    ds_in["lat"] = ds_in["latitude"]
    ds_in["lon"] = ds_in["longitude"]
    return xe.Regridder(ds_in, ds_out, method, periodic = periodic, 
                        ignore_degenerate = ignore_degenerate) 
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[10], line 2
      1 # In case of problems, activate ESMF verbose mode
----> 2 import ESMF
      3 ESMF.Manager(debug = True)
      5 # Regridding methods
      6 #method_list = ["bilinear", "nearest_s2d", "patch", "conservative", "conservative_normed"]

ModuleNotFoundError: No module named 'ESMF'
%time regridder = regrid(pre_industrial_mean, ds_out, "bilinear", periodic = True, unmapped_to_nan = True, ignore_degenerate = None) 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:1

NameError: name 'regrid' is not defined
regridder
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 regridder

NameError: name 'regridder' is not defined
regridded_ds = regridder(pre_industrial_mean, keep_attrs = True)
regridded_ds
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 regridded_ds = regridder(pre_industrial_mean, keep_attrs = True)
      2 regridded_ds

NameError: name 'regridder' is not defined
fig = plt.figure(1, figsize = [30, 13])

# Set the projection to use for plotting
ax = plt.subplot(1, 1, 1, projection = ccrs.PlateCarree())
ax.coastlines()

# Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting
# isel allows to select by indices instead of the time values
regridded_ds.plot.pcolormesh(ax = ax, cmap = "coolwarm")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 9
      5 ax.coastlines()
      7 # Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting
      8 # isel allows to select by indices instead of the time values
----> 9 regridded_ds.plot.pcolormesh(ax = ax, cmap = "coolwarm")

NameError: name 'regridded_ds' is not defined
_images/17a193bc6bbfbdd24bc60f8fa2ab1e69cd3445274ad3d682558984a325aa1b27.png

As we can see, the data has a correct CRS now.

4. Write a CRS and save the data set as a GeoTIFF#

It is pretty straitforward to write a CRS to an xarray with rioxarray. Here we used the World Geodetic System 1984 CRS which is quite common and used by default, for example in QGIS.

#You might need to rename the axes before
transformed = regridded_ds.rename({"lon":"x", "lat":"y"})

transformed = transformed.rio.write_crs("EPSG:4326")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 2
      1 #You might need to rename the axes before
----> 2 transformed = regridded_ds.rename({"lon":"x", "lat":"y"})
      4 transformed = transformed.rio.write_crs("EPSG:4326")

NameError: name 'regridded_ds' is not defined

If we want to re-project the data set to e.g., Web Mercator, wich is used by default in Google Maps and ArcGIS, we do it as follows:

transformed = transformed.rio.reproject("EPSG:3857")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 transformed = transformed.rio.reproject("EPSG:3857")

NameError: name 'transformed' is not defined

And finally, saving our dataset as a GeoTIFF is as easy as this:

transformed.rio.to_raster("regridded_3857.tif")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 transformed.rio.to_raster("regridded_3857.tif")

NameError: name 'transformed' is not defined