Interfacing with Dataset Uncertainty#

obsarray enables users to define, store and interface with uncertainty information in xarray.Dataset’s.

To access the uncertainty information for a particular measured variable, use the dataset’s unc accessor.

You can see which dataset variables have uncertainty variables associated with them by looking at unc accessor keys.

In [1]: import obsarray

In [2]: ds.unc.keys()
Out[2]: ['temperature', 'precipitation']

This means that the dataset variables temperature and precipitation have uncertainty data defined for them. We refer to these variables as “observation variables”.

Interfacing with Variable Uncertainty#

To inspect the uncertainty data defined for a particular observation variable, index the unc accessor with it’s name.

In [3]: ds.unc["temperature"]
Out[3]: 
<VariableUncertainty>
Variable Uncertainties: 'temperature'
Data variables:
    u_ran_temperature  (x, y, time) float64 96B 3.578 3.835 ... 2.993 5.463
    u_sys_temperature  (x, y, time) float64 96B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0

This returns a VariableUncertainty object, which provides an interface to an observation variable’s uncertainty information. This shows that temperature has two uncertainty variables - u_ran_temperature and u_sys_temperature.

To evaluate the total uncertainty for the observation variable, run:

In [4]: ds.unc["temperature"].total_unc()
Out[4]: 
<xarray.DataArray (x: 2, y: 2, time: 3)> Size: 96B
array([[[6.14855263, 6.30121643, 7.43347842],
        [6.29813622, 6.04956665, 5.61044954]],

       [[5.08962405, 6.06621457, 6.6745898 ],
        [6.37961975, 5.8272454 , 7.40572536]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y

which combines all the uncertainty components by sum of squares. Similarly, you can see the combined random or systematic uncertainty components (where more than one component of either is defined), as follows,

In [5]: ds.unc["temperature"].random_unc()
Out[5]: 
<xarray.DataArray (x: 2, y: 2, time: 3)> Size: 96B
array([[[3.57836547, 3.83475273, 5.50060009],
        [3.82968926, 3.40547452, 2.5450234 ]],

       [[0.95093267, 3.43496131, 4.42155504],
        [3.96226553, 2.9927895 , 5.46303652]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y

In [6]: ds.unc["temperature"].systematic_unc()
Out[6]: 
<xarray.DataArray (x: 2, y: 2, time: 3)> Size: 96B
array([[[5., 5., 5.],
        [5., 5., 5.]],

       [[5., 5., 5.],
        [5., 5., 5.]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y

You can also see the combined error-correlation matrix,

In [7]: ds.unc["temperature"].total_err_corr_matrix()
Out[7]: 
<xarray.DataArray (x.y.time: 12)> Size: 1kB
array([[1.        , 0.64527185, 0.54698451, ..., 0.63734168, 0.69775636,
        0.54903434],
       [0.64527185, 1.        , 0.53373235, ..., 0.62190037, 0.68085135,
        0.53573252],
       [0.54698451, 0.53373235, 1.        , ..., 0.52717296, 0.57714457,
        0.45413014],
       ...,
       [0.63734168, 0.62190037, 0.52717296, ..., 1.        , 0.67248392,
        0.52914855],
       [0.69775636, 0.68085135, 0.57714457, ..., 0.67248392, 1.        ,
        0.57930743],
       [0.54903434, 0.53573252, 0.45413014, ..., 0.52914855, 0.57930743,
        1.        ]])
Dimensions without coordinates: x.y.time, x.y.time

This gives the cross-element error-correlation between each element in the temperature array. The order of the observation elements along both dimensions of the error-correlation matrix is defined by the order flatten() method produces.

Similar, the error-covariance matrix,

In [8]: ds.unc["temperature"].total_err_cov_matrix()
Out[8]: 
<xarray.DataArray (x.y.time: 12)> Size: 1kB
array([[37.80469946, 25.        , 25.        , ..., 25.        ,
        25.        , 25.        ],
       [25.        , 39.70532854, 25.        , ..., 25.        ,
        25.        , 25.        ],
       [25.        , 25.        , 55.25660138, ..., 25.        ,
        25.        , 25.        ],
       ...,
       [25.        , 25.        , 25.        , ..., 40.69954815,
        25.        , 25.        ],
       [25.        , 25.        , 25.        , ..., 25.        ,
        33.95678899, 25.        ],
       [25.        , 25.        , 25.        , ..., 25.        ,
        25.        , 54.84476806]])
Dimensions without coordinates: x.y.time, x.y.time

You can also do this to access a subset of the total error-covariance matrix by indexing with the slice of interest (this can avoid building the whole error-covariance matrix in memory).

# error-covariance matrix for measurements at one time step
In [9]: ds.unc["temperature"][:,:,1].total_err_cov_matrix()
Out[9]: 
<xarray.DataArray (x.y: 4)> Size: 128B
array([[39.70532854, 25.        , 25.        , 25.        ],
       [25.        , 36.5972567 , 25.        , 25.        ],
       [25.        , 25.        , 36.79895923, 25.        ],
       [25.        , 25.        , 25.        , 33.95678899]])
Dimensions without coordinates: x.y, x.y

Interfacing with Uncertainty Components#

To inspect a specific uncertainty component of an observation variable, index the variable uncertainty with its name.

In [10]: ds.unc["temperature"]["u_ran_temperature"]
Out[10]: 
<Uncertainty> 
'u_ran_temperature' (x: 2, y: 2, time: 3)> Size: 96B
array([[[3.57836547, 3.83475273, 5.50060009],
        [3.82968926, 3.40547452, 2.5450234 ]],

       [[0.95093267, 3.43496131, 4.42155504],
        [3.96226553, 2.9927895 , 5.46303652]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Attributes: (12/14)
    _FillValue:         9.969209968386869e+36
    err_corr_1_dim:     x
    err_corr_1_form:    random
    err_corr_1_units:   []
    err_corr_1_params:  []
    err_corr_2_dim:     y
    ...                 ...
    err_corr_2_params:  []
    err_corr_3_dim:     time
    err_corr_3_form:    random
    err_corr_3_units:   []
    err_corr_3_params:  []
    pdf_shape:          gaussian

This returns a Uncertainty object, which provides an interface to a specific uncertainty variable.

The error correlation structure of the uncertainty variable can be inspected as follows:

In [11]: ds.unc["temperature"]["u_ran_temperature"].err_corr
Out[11]: [('time', random []), ('x', random []), ('y', random [])]

In [12]: ds.unc["temperature"]["u_ran_temperature"].err_corr_dict()
Out[12]: {'time': 'random', 'x': 'random', 'y': 'random'}

In [13]: ds.unc["temperature"]["u_ran_temperature"].err_corr_matrix()
Out[13]: 
<xarray.DataArray (x.y.time: 12)> Size: 1kB
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])
Dimensions without coordinates: x.y.time, x.y.time

Adding/Removing Uncertainty Components#

The same interface can be used to add/remove uncertainty components from the dataset, safely handling the uncertainty metadata. This is achieved following a similar syntax to the xarray convention, as ds.unc["var"]["u_var"] = (dims, values, attributes).

To define the error-correlation structure, the attributes must contain an entry called err_corr with a list that defines the error-correlation structure per data dimension (if omitted the error-correlation is assumed random). How to define these is defined in detail in the dataset templating section. See below for an example:

# Define error-correlation structure
In [14]: err_corr_def = [
   ....:     {
   ....:         "dim": ["x", "y"],
   ....:         "form": "systematic",
   ....:         "params": [],
   ....:         "units": []
   ....:     },
   ....:     {
   ....:         "dim": ["time"],
   ....:         "form": "random",
   ....:         "params": [],
   ....:         "units": []
   ....:     }
   ....: ]
   ....: 

# Define uncertainty values at 5%
In [15]: unc_values = ds["temperature"] * 0.05

# Add uncertainty variable
In [16]: ds.unc["temperature"]["u_str_temperature"] = (
   ....:     ["x", "y", "time"],
   ....:     unc_values,
   ....:     {"err_corr": err_corr_def, "pdf_shape": "gaussian"}
   ....: )
   ....: 

# Check uncertainties
In [17]: ds.unc["temperature"].keys()
Out[17]: ['u_ran_temperature', 'u_sys_temperature', 'u_str_temperature']

A component of uncertainty can be simply be deleted as,

In [18]: del ds.unc["temperature"]["u_str_temperature"]

# Check uncertainties
In [19]: ds.unc["temperature"].keys()
Out[19]: ['u_ran_temperature', 'u_sys_temperature']

Renaming Variables#

The storage of uncertainty information is underpinned by variable attributes, which include referencing other variables (for example, which variables are the uncertainties associated with a particular observation variable). Because of this it is important, if renaming uncertainty variables, to use obsarray’s renaming functionality. This renames the uncertainty variable and safely updates attribute variable references. This is done as follows:

In [20]: print(ds.unc["temperature"])
<VariableUncertainty>
Variable Uncertainties: 'temperature'
Data variables:
    u_ran_temperature  (x, y, time) float64 96B 3.578 3.835 ... 2.993 5.463
    u_sys_temperature  (x, y, time) float64 96B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0

In [21]: ds = ds.unc["temperature"]["u_ran_temperature"].rename("u_noise")

In [22]: print(ds.unc["temperature"])
<VariableUncertainty>
Variable Uncertainties: 'temperature'
Data variables:
    u_noise            (x, y, time) float64 96B 3.578 3.835 ... 2.993 5.463
    u_sys_temperature  (x, y, time) float64 96B 5.0 5.0 5.0 5.0 ... 5.0 5.0 5.0