Data validation#

Validation is automatically triggered when creating header and data objects through reading data or when (re)setting header and data tables (e.g. this happens when making selections).

Validation checks that the bare minimum amount of columns are present (see Data structures) and makes sure that they contain data of the correct datatype (e.g. strings, float, integers). In addition it tests some basic logic that the table data must adhere to. For example: the ‘top’ in layered data must be greater than the ‘bottom’ of the layer and above 0 as we define layer tops and bottoms as positive downward starting from 0 in GeoST. Doing this ensures that GeoST functions always work on strictly defined and valid data that leads to reproducible results.

Validation settings#

There are a number of global settings that can be set to control the behaviour:

Setting

Description

Default

VERBOSE

If True, the details of validation errors will be printed to the console

True

DROP_INVALID

If True, invalid rows will automatically be dropped from (Geo)DataFrames

True

FLAG_INVALID

If True, invalid rows will be flagged in (Geo)DataFrames. Only works if DROP_INVALID is False

False

AUTO_ALIGN

If True, collection headers and data tables will automatically be aligned

True

You can access and manipulate these settings through the geost config module:

from geost import config

# E.g. turning off verbose validation warnings
config.validation.VERBOSE = False

Examples#

In the below examples we create a dataframe with some layer data and intentionally create some problems to show what to expect from the validation and the different settings.

import pandas as pd

from geost.base import LayeredData

# A dataframe that describes two layers correctly according to GeoST standards
df_correct = pd.DataFrame(
    {
        "nr": ["B-01", "B-01"],
        "x": [100, 100],
        "y": [200, 200],
        "surface": [0, 0],
        "end": [-1, -1],
        "top": [0, 0.5],
        "bottom": [0.5, 1],
        "lithoclass": ["K", "Z"],
    }
)

# Creating a LayeredData object from this table triggers validation.
# -> All good, no warnings!
layered_data = LayeredData(df_correct)

print(layered_data)
LayeredData instance:
     nr      x      y  surface  end  top  bottom lithoclass
0  B-01  100.0  200.0      0.0 -1.0  0.0     0.5          K
1  B-01  100.0  200.0      0.0 -1.0  0.5     1.0          Z

Now we change the top of the second layer to 1.1. This cannot be, because the bottom of of this layer is 1. We therefore expect a ValidationWarning

# Create an invalid layer in the example dataframe
df_invalid = df_correct.copy()
df_invalid.loc[1, "top"] = 1.1

# Validation setting VERBOSE on to show warning details
config.validation.VERBOSE = True

# Creating a LayeredData object from the invalid dataframe
# -> triggers a ValidationWarning
layered_data = LayeredData(df_invalid)

# Because the setting DROP_INVALID is turned on, the invalid layer is dropped
print(layered_data)
LayeredData instance:
     nr      x      y  surface  end  top  bottom lithoclass
0  B-01  100.0  200.0      0.0 -1.0  0.0     0.5          K
/home/runner/work/geost/geost/geost/validation/validate.py:46: ValidationWarning: 
Validation dropped 1 row(s) for schema 'Layer data non-inclined'.
Dropped indices: [1]

  warnings.warn(

In the above example you are warned about the ValidationError. Because the setting DROP_INVALID is turned on, we receive the message that one row was dropped from the table. As you can see, the layered_data now includes only the valid row.

If you don’t want to drop rows automatically, turn ‘DROP_INVALID’ off. Use this at your own risk as GeoST functions may break because of this. You may also turn on the setting ‘FLAG_INVALID’ to add a column that indicates whether a row passed through validation or not.

config.validation.DROP_INVALID = False

layered_data = LayeredData(df_invalid)

# Layered data retains invalid layers, use at your own risk!
print(layered_data)
LayeredData instance:
     nr    x    y  surface  end  top  bottom lithoclass
0  B-01  100  200        0   -1  0.0     0.5          K
1  B-01  100  200        0   -1  1.1     1.0          Z
/home/runner/work/geost/geost/geost/validation/validate.py:54: ValidationWarning: 
Validation failed for schema 'Layer data non-inclined'.
Details:
DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: <Check <lambda>> failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z

  warnings.warn(
config.validation.FLAG_INVALID = True

layered_data = LayeredData(df_invalid)

# Layered data retains invalid layers, but a column 'is_valid' is added to indicate validity
print(layered_data)
LayeredData instance:
     nr    x    y  surface  end  top  bottom lithoclass  is_valid
0  B-01  100  200        0   -1  0.0     0.5          K      True
1  B-01  100  200        0   -1  1.1     1.0          Z     False
/home/runner/work/geost/geost/geost/validation/validate.py:54: ValidationWarning: 
Validation failed for schema 'Layer data non-inclined'.
Details:
DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: <Check <lambda>> failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z

  warnings.warn(

Advanced: manually validating data#

In principle, there is no need to worry about validating data as the validation is called automatically when reading, parsing and manipulating data. However, should the need arise to validate data manually, you can do this by manually applying validation to a dataframe. This could be useful for example when doing a custom operation on a dataframe and you are unsure whether the manipulated dataframe is still compatible with GeoST.

We recommend to use the geost.validation.safe_validate function as it is designed to raise warnings (instead of errors) and takes into account the global validation settings. In addition, you must choose which pre-defined data schema is used. You can find the available schemas in the module geost.validation.schemas. Alternatively, you can also directly use the Pandera DataFrameSchemas found in this module.

from geost.validation import safe_validate, schemas

# Validate the df_correct dataframe using the layer dataschema
# -> no warning!
df_valid = safe_validate(schemas.layerdata, df_correct)
print(df_valid)

# Validate the df_invalid dataframe using the layer dataschema
# -> warning, is_valid column added because FLAG_INVALID setting is on.
df_invalid_flagged = safe_validate(schemas.layerdata, df_invalid)
print(df_invalid_flagged)
     nr      x      y  surface  end  top  bottom lithoclass
0  B-01  100.0  200.0      0.0 -1.0  0.0     0.5          K
1  B-01  100.0  200.0      0.0 -1.0  0.5     1.0          Z
     nr    x    y  surface  end  top  bottom lithoclass  is_valid
0  B-01  100  200        0   -1  0.0     0.5          K      True
1  B-01  100  200        0   -1  1.1     1.0          Z     False
/home/runner/work/geost/geost/geost/validation/validate.py:54: ValidationWarning: 
Validation failed for schema 'Layer data non-inclined'.
Details:
DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: <Check <lambda>> failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z, False

  warnings.warn(