Creating Datasets Using Templates#

Basic template definition#

obsarray can create xarray.Dataset’s to a particular template, defined by a dict (referred to hereafter as template dictionaries).

Every key in the template dictionary is the name of a variable, with the corresponding entry a further variable specification dictionary (referred to hereafter as variable dictionaries). Each variable dictionary defines the following entries:

  • dim - list of variable dimension names.

  • dtype - variable data type, generally a numpy.dtype, though for some special variables particular values may be required.

  • attributes - dictionary of variable metadata, for some special variables particular entries may be required.

  • encoding - (optional) variable encoding.

Altogether then, we can define the following basic template:

In [1]: import numpy as np

In [2]: temp_var_dict = {
   ...:     "dim": ["lon", "lat", "time"],
   ...:     "dtype": np.float32,
   ...:     "attributes": {"units": "K"}
   ...: }
   ...: 

In [3]: template = {
   ...:     "temp": temp_var_dict,
   ...: }
   ...: 

Special variable types#

obsarray’s special variables allow the quick definition of variables in a set of standardised templates. The following section describes the types of special variable available and how to define them in a template.

Flags#

Setting the dtype as "flag" in the variable dictionary builds a variable in the cf conventions flag format. Each datum bit corresponds to boolean condition flag with a given meaning.

The variable must be defined with a "flag_meanings" attribute that lists the per bit flag meanings as follows:

In [4]: variables = {
   ...:     "quality_flag": {
   ...:         "dim": ["x", "y"],
   ...:         "dtype": "flag",
   ...:         "attributes": {
   ...:             "flag_meanings": ["good_data", "bad_data"]
   ...:         }
   ...:     }
   ...: }
   ...: 

The smallest necessary integer is used as the flag variable numpy.dtype, given the number of flag meanings defined (i.e., defining 7 flag meanings results in an 8-bit integer variable).

Once built, flag variables can be interfaced with via the obsarray’s flag accessor (extension to xarray.Dataset) - see the section on interfacing with flags for more.

Uncertainties#

Recent work in the Earth Observation metrology domain is working towards the standardisation of the representation of measurement uncertainty information in data, with a particular focus on capturing the error-covariance associated with the uncertainty. Although it can be the case that for large, multi-dimensional arrays of measurements storing a full error-covariance matrix would be impractical, often the error-covariance between measurements may be efficiently parameterised. Work to standardise such parameterisations is on-going (see for example the EU H2020 FIDUCEO project defintions list in Appendix A of this project report).

obsarray enables the specification of such error-correlation parameterisations for uncertainty variables through the variable attributes. This is achieved by including an "err_corr" entry in the variable dictionary attributes. This "err_corr" entry is a list of dictionaries defining the error-correlation along one or more dimensions, which should include the following entries:

  • dim (str/list) - name of the dimension(s) as a str or list of str’s (i.e. from dim_names)

  • form (str) - error-correlation form, defines functional form of error-correlation structure along dimension. Suggested error-correlation forms are defined in a table below.

  • params (list) - (optional) parameters of the error-correlation structure defining function for dimension if required. The number of parameters required depends on the particular form.

  • units (list) - (optional) units of the error-correlation function parameters for dimension (ordered as the parameters)

Measurement variables with uncertainties should include a list of unc_comps variable names in their attributes.

Suggested error correlation parameterisations (to be extended in future)#

Form Name

Parameters

Description

"random"

None required

Errors uncorrelated along dimension(s)

"systematic"

None required

Errors fully correlated along dimension(s)

"custom"

Error-correlation matrix variable name

Error-correlation for dimension(s) not parameterised, defined as a full matrix in another named variable in dataset.

Updating the above example to include an uncertainty component, we can therefore define:

In [5]: import numpy as np

In [6]: temp_var_dict = {
   ...:     "dim": ["lon", "lat", "time"],
   ...:     "dtype": np.float32,
   ...:     "attributes": {"units": "K"}
   ...: }
   ...: 

In [7]: u_temp_var_dict = {
   ...:     "dim": ["lon", "lat", "time"],
   ...:     "dtype": np.float16,
   ...:     "attributes": {
   ...:         "units": "K",
   ...:         "err_corr": [{"dim": ["lat", "lon"], "form": "systematic"}]
   ...:     }
   ...: }
   ...: 

In [8]: template = {
   ...:     "temp": temp_var_dict,
   ...:     "u_temp": u_temp_var_dict,
   ...: }
   ...: 

If the error-correlation structure is not defined along a particular dimension (i.e. it is not included in err_corr), the error-correlation is assumed random in this dimension. So, in the above example, the u_temp uncertainty is defined to be systematic between all spatial points (i.e., across the lat and lon dimensions) at each time step, but random between time steps (i.e, along the time dimension) as this is not explicitly defined.

Once built, uncertainty variables can be interfaced with via the obsarray’s unc accessor (extension to xarray.Dataset) - see the section on interfacing with data uncertainty for more.

Creating a template dataset#

With the template dictionary prepared, only two more specifications are required to build a template dataset. First a dictionary that defines the sizes of all the dimensions used in the template dictionary, e.g.:

In [9]: dim_sizes = {"lat": 20, "lon": 10, "time": 5}

Secondly, a dictionary of dataset global metadata, e.g.:

In [10]: metadata = {"dataset_name": "temperature dataset"}

Combining the above together a template dataset can be created as follows:

In [11]: import obsarray

In [12]: ds = obsarray.create_ds(
   ....:     template,
   ....:     dim_sizes,
   ....:     metadata
   ....: )
   ....: 

In [13]: print(ds)
<xarray.Dataset> Size: 6kB
Dimensions:  (lon: 10, lat: 20, time: 5)
Dimensions without coordinates: lon, lat, time
Data variables:
    temp     (lon, lat, time) float32 4kB 9.969e+36 9.969e+36 ... 9.969e+36
    u_temp   (lon, lat, time) float16 2kB nan nan nan nan ... nan nan nan nan
Attributes:
    dataset_name:  temperature dataset

Where ds is an empty xarray dataset with variables defined by the template definition. Fill values for the empty arrays are chosen using the cf convention values.

Populating and writing the dataset can then be achieved using xarray’s builtin functionality.