Data validation#

Validation is automatically triggered when creating header and data objects through reading data or when (re)setting header and data tables (e.g. this happens when making selections).

Validation checks that the bare minimum amount of columns are present (see Data structures) and makes sure that they contain data of the correct datatype (e.g. strings, float, integers). In addition it tests some basic logic that the table data must adhere to. For example: the ‘top’ in layered data must be greater than the ‘bottom’ of the layer and above 0 as we define layer tops and bottoms as positive downward starting from 0 in GeoST. Doing this ensures that GeoST functions always work on strictly defined and valid data that leads to reproducible results.

Validation settings#

There are a number of global settings that can be set to control the behaviour:

Setting

Description

Default

SKIP

If True, validation will be skipped entirely

False

VERBOSE

If True, the details of validation errors will be printed to the console

True

DROP_INVALID

If True, invalid rows will automatically be dropped from (Geo)DataFrames

True

FLAG_INVALID

If True, invalid rows will be flagged in (Geo)DataFrames. Only works if DROP_INVALID is False

False

AUTO_ALIGN

If True, collection headers and data tables will automatically be aligned

True

You can access and manipulate these settings through the geost config module:

import geost

# E.g. turning off verbose validation warnings
geost.config.validation.VERBOSE = False

Examples#

In the below examples we create a dataframe with some layer data and intentionally create some problems to show what to expect from the validation and the different settings.

import pandas as pd

# from geost.base import LayeredData

# A dataframe that describes two layers correctly according to GeoST standards
df_correct = pd.DataFrame(
    {
        "nr": ["B-01", "B-01"],
        "x": [100, 100],
        "y": [200, 200],
        "surface": [0, 0],
        "end": [-1, -1],
        "top": [0, 0.5],
        "bottom": [0.5, 1],
        "lithoclass": ["K", "Z"],
    }
)

# Creating a Collection object from this table triggers validation.
# -> All good, no warnings!
collection = df_correct.gstda.to_collection()
print(collection)
BoreholeCollection:
# header = 1

Now we change the top of the second layer to 1.1. This cannot be, because the bottom of of this layer is 1. We therefore expect a ValidationWarning

# Validation setting VERBOSE on to show warning details
geost.config.validation.VERBOSE = True

# Create an invalid layer in the example dataframe
df_invalid = df_correct.copy()
df_invalid.loc[1, "top"] = 1.1


# Creating a Collection object from the invalid dataframe
# -> triggers a ValidationWarning
collection = df_invalid.gstda.to_collection()

# Because the setting DROP_INVALID is turned on, the invalid layer is dropped
print(collection.data)
     nr      x      y  surface  end  top  bottom lithoclass
0  B-01  100.0  200.0      0.0 -1.0  0.0     0.5          K
/home/runner/work/geost/geost/geost/validation/validate.py:47: ValidationWarning: 
Validation dropped 1 row(s) for schema 'Layer data non-inclined'.
Dropped indices: [1]

  warnings.warn(

In the above example you are warned about the ValidationError because the VERBOSE setting was True. Also, because the setting DROP_INVALID is turned on, we receive the message that one row was dropped from the table. As you can see, the layered_data now includes only the valid row.

If you don’t want to drop rows automatically, turn ‘DROP_INVALID’ off. Use this at your own risk as GeoST functions may break because of this.

geost.config.validation.DROP_INVALID = False

collection = df_invalid.gstda.to_collection()

# Layered data retains invalid layers, use at your own risk!
print(collection.data)
     nr    x    y  surface  end  top  bottom lithoclass
0  B-01  100  200        0   -1  0.0     0.5          K
1  B-01  100  200        0   -1  1.1     1.0          Z
/home/runner/work/geost/geost/geost/validation/validate.py:55: ValidationWarning: 
Validation failed for schema 'Layer data non-inclined'.
Details:
DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: <Check <lambda>> failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z

  warnings.warn(

You may also turn on the setting ‘FLAG_INVALID’ to add a column that indicates whether a row passed through validation or not.

geost.config.validation.FLAG_INVALID = True

collection = df_invalid.gstda.to_collection()

# Layered data retains invalid layers, but a column 'is_valid' is added to indicate validity
print(collection.data)
     nr    x    y  surface  end  top  bottom lithoclass  is_valid
0  B-01  100  200        0   -1  0.0     0.5          K      True
1  B-01  100  200        0   -1  1.1     1.0          Z     False
/home/runner/work/geost/geost/geost/validation/validate.py:55: ValidationWarning: 
Validation failed for schema 'Layer data non-inclined'.
Details:
DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: <Check <lambda>> failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z

  warnings.warn(

In some cases, it may be desired to skip the validation in its entirety. For example, when you would like to use functionality associated with a Collection but you know that the validation will fail anyway and you do not want update the data to make it pass. Then the validation can be turned off by setting the ‘SKIP’ flag in the configuration to True like below.

geost.config.validation.SKIP = True

collection = (
    df_invalid.gstda.to_collection()
)  # Will not raise any warning, validation is skipped

collection.data  # The data in the result will be unchanged
nr x y surface end top bottom lithoclass is_valid
0 B-01 100 200 0 -1 0.0 0.5 K True
1 B-01 100 200 0 -1 1.1 1.0 Z False

Advanced: manually validating data#

In principle, there is no need to worry about validating data as the validation is called automatically when reading, parsing and manipulating data. However, should the need arise to validate data manually, you can do this by manually applying validation to a dataframe. This could be useful for example when doing a custom operation on a dataframe and you are unsure whether the manipulated dataframe is still compatible with GeoST.

We recommend to use the geost.validation.safe_validate function as it is designed to raise warnings (instead of errors) and takes into account the global validation settings. In addition, you must choose which pre-defined data schema is used. You can find the available schemas in the module geost.validation.schemas. Alternatively, you can also directly use the Pandera DataFrameSchemas found in this module.

from geost.validation import safe_validate, schemas

# Validate the df_correct dataframe using the layer dataschema
# -> no warning!
df_valid = safe_validate(df_correct, schemas.layerdata)
print(df_valid)

# Validate the df_invalid dataframe using the layer dataschema
# -> warning, is_valid column added because FLAG_INVALID setting is on.
df_invalid_flagged = safe_validate(df_invalid, schemas.layerdata)
print(df_invalid_flagged)
     nr      x      y  surface  end  top  bottom lithoclass
0  B-01  100.0  200.0      0.0 -1.0  0.0     0.5          K
1  B-01  100.0  200.0      0.0 -1.0  0.5     1.0          Z
     nr    x    y  surface  end  top  bottom lithoclass  is_valid
0  B-01  100  200        0   -1  0.0     0.5          K      True
1  B-01  100  200        0   -1  1.1     1.0          Z     False
/home/runner/work/geost/geost/geost/validation/validate.py:55: ValidationWarning: 
Validation failed for schema 'Layer data non-inclined'.
Details:
DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: <Check <lambda>> failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z, False

  warnings.warn(

We can also easily reset the config settings to default. This way, validation on the invalid DataFrame will not add the “is_valid” column.

geost.config.validation.reset_settings()

# Create a new invalid DataFrame because the previous already modified
df_invalid = df_correct.copy()
df_invalid.loc[1, "top"] = 1.1

df_invalid = safe_validate(df_invalid, schemas.layerdata)
print(df_invalid)
     nr      x      y  surface  end  top  bottom lithoclass
0  B-01  100.0  200.0      0.0 -1.0  0.0     0.5          K
/home/runner/work/geost/geost/geost/validation/validate.py:47: ValidationWarning: 
Validation dropped 1 row(s) for schema 'Layer data non-inclined'.
Dropped indices: [1]

  warnings.warn(