Data validation#
Validation is automatically triggered upon using the built-in readers to read
data and whenever a Collection is created or its header/data is modified. In addition, you can manually invoke
validation on a pandas.DataFrame through the gst accessor.
Validation checks that the bare minimum amount of columns are present (see Positional columns)
and makes sure that they contain data of the correct datatype (e.g. strings, float, integers).
In addition it tests some basic logic that the table data must adhere to. For example: the ‘top’
in layered data must be greater than the ‘bottom’ of the layer and above 0 as we define layer tops
and bottoms as positive downward starting from 0 in GeoST. Doing this ensures that GeoST functions
always work on strictly defined and valid data that leads to reproducible results.
Validation settings#
There are a number of global settings that can be set to control the behaviour:
Setting |
Description |
Default |
|---|---|---|
SKIP |
If True, validation will be skipped entirely |
False |
VERBOSE |
If True, the details of validation errors will be printed to the console |
True |
DROP_INVALID |
If True, invalid rows will automatically be dropped from (Geo)DataFrames |
True |
FLAG_INVALID |
If True, invalid rows will be flagged in (Geo)DataFrames. Only works if DROP_INVALID is False |
False |
AUTO_ALIGN |
If True, collection headers and data tables will automatically be aligned |
True |
You can access and manipulate these settings through the geost config module:
import geost
# E.g. turning off verbose validation warnings
geost.config.validation.VERBOSE = False
# Reset validation settings to their default values
geost.config.validation.reset_settings()
Examples#
In the below example we create a dataframe with two boreholes (A and B) that each have two layers.
import pandas as pd
# A dataframe that describes two boreholes correctly according to GeoST standards
df_correct = pd.DataFrame(
{
"nr": ["A", "A", "B", "B"],
"x": [100, 100, 150, 150],
"y": [200, 200, 250, 250],
"surface": [1, 1, 0, 0],
"end": [0, 0, -1, -1],
"top": [0, 0.5, 0, 0.5],
"bottom": [0.5, 1, 0.5, 1],
"lithoclass": ["K", "Z", "L", "V"],
}
)
Validation can be invoked in the following ways:
Calling the
gstaccessor’svalidatemethod (GeostFrame.validate).Creating a collection (using e.g. GeostFrame.to_collection)
(Re)-assigning the
headeranddataattributes of aCollection.
# Invoking validation using GeostFrame.validate
validated_df = df_correct.gst.validate()
# Invoking validation by creating a Collection object from a DataFrame using the gst accessor
collection = df_correct.gst.to_collection()
# Assigning the data or header attrs of a Collection object also triggers validation
collection.data = df_correct
Now we change the top of the second layer of borehole A to 1.1. This cannot be, because the bottom of
of this layer is 1. We therefore expect a ValidationWarning.
# Create an invalid layer in the example dataframe
df_invalid = df_correct.copy()
df_invalid.loc[1, "top"] = 1.1
# Creating a Collection object from the invalid dataframe triggers validation.
# -> A warning is raised
collection = df_invalid.gst.to_collection()
✅ Invalid surveys were dropped from the DataFrame because geost.config.validation.DROP_INVALID=True
📖 See the user guide section on validation for advanced handling of validation issues: https://deltares-research.github.io/geost/user_guide/validation.html
NOTE: Header has been reset to align with data because AUTO_ALIGN is enabled in the GeoST configuration.
/home/runner/work/geost/geost/geost/validation/validate.py:77: ValidationWarning:
============================================================
⚠️ VALIDATION ISSUE (1/1)
============================================================
Column : 'top, bottom'
Message : Column 'top' must be less than 'bottom', but some rows violate this condition.
# surveys : 1
# rows : 1
============================================================
warnings.warn(
/home/runner/work/geost/geost/geost/base.py:473: AlignmentWarning: Header covers more/other objects than present in the data table. consider running the method 'reset_header' to update the header.
warnings.warn(
In this example we receive a validation warning for the ‘top’ and ‘bottom’ column alongside the course of action that was taken based on the validation config settings. In this case invalid survey A was dropped and the header was reset to represent the new set of surveys.
print(f"Header:\n{collection.header}\n")
print(f"Data:\n{collection.data}\n")
Header:
nr x y surface geometry
0 B 150 250 0 POINT (150 250)
Data:
nr x y surface end top bottom lithoclass
2 B 150 250 0 -1 0.0 0.5 L
3 B 150 250 0 -1 0.5 1.0 V
If you don’t want to drop an entire survey when one or several of its data rows do not validate, you can choose to turn off the DROP_INVALID option at your own risk. Some collection methods may yield unexpected results if validation problems are ignored.
geost.config.validation.DROP_INVALID = False
validated_df = df_invalid.gst.validate()
# Layered data retains invalid layers, use at your own risk!
validated_df
❌ Invalid surveys were retained in the DataFrame because geost.config.validation.FLAG_INVALID and DROP_INVALID are False
📖 See the user guide section on validation for advanced handling of validation issues: https://deltares-research.github.io/geost/user_guide/validation.html
/home/runner/work/geost/geost/geost/validation/validate.py:77: ValidationWarning:
============================================================
⚠️ VALIDATION ISSUE (1/1)
============================================================
Column : 'top, bottom'
Message : Column 'top' must be less than 'bottom', but some rows violate this condition.
# surveys : 1
# rows : 1
============================================================
warnings.warn(
| nr | x | y | surface | end | top | bottom | lithoclass | |
|---|---|---|---|---|---|---|---|---|
| 0 | A | 100 | 200 | 1 | 0 | 0.0 | 0.5 | K |
| 1 | A | 100 | 200 | 1 | 0 | 1.1 | 1.0 | Z |
| 2 | B | 150 | 250 | 0 | -1 | 0.0 | 0.5 | L |
| 3 | B | 150 | 250 | 0 | -1 | 0.5 | 1.0 | V |
You may also turn on the setting ‘FLAG_INVALID’ to add a column that indicates whether a
row passed through validation or not. This adds the boolean is_valid column to the data dataframe:
geost.config.validation.FLAG_INVALID = True
flagged_df = df_invalid.gst.validate()
# Layered data retains invalid layers, but a column 'is_valid' is added to indicate validity
flagged_df
✅ Invalid rows were flagged with an 'is_valid' column because geost.config.validation.FLAG_INVALID=True and geost.config.validation.DROP_INVALID=False
📖 See the user guide section on validation for advanced handling of validation issues: https://deltares-research.github.io/geost/user_guide/validation.html
/home/runner/work/geost/geost/geost/validation/validate.py:77: ValidationWarning:
============================================================
⚠️ VALIDATION ISSUE (1/1)
============================================================
Column : 'top, bottom'
Message : Column 'top' must be less than 'bottom', but some rows violate this condition.
# surveys : 1
# rows : 1
============================================================
warnings.warn(
| nr | x | y | surface | end | top | bottom | lithoclass | is_valid | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 100 | 200 | 1 | 0 | 0.0 | 0.5 | K | True |
| 1 | A | 100 | 200 | 1 | 0 | 1.1 | 1.0 | Z | False |
| 2 | B | 150 | 250 | 0 | -1 | 0.0 | 0.5 | L | True |
| 3 | B | 150 | 250 | 0 | -1 | 0.5 | 1.0 | V | True |
Flagging invalid rows can be especially useful if you want to handle validation problems with custom logic.
# Set top of the invalid layer back to 0.5, such that the layer becomes valid again.
flagged_df.loc[~flagged_df["is_valid"], "top"] = 0.5
flagged_df
| nr | x | y | surface | end | top | bottom | lithoclass | is_valid | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 100 | 200 | 1 | 0 | 0.0 | 0.5 | K | True |
| 1 | A | 100 | 200 | 1 | 0 | 0.5 | 1.0 | Z | False |
| 2 | B | 150 | 250 | 0 | -1 | 0.0 | 0.5 | L | True |
| 3 | B | 150 | 250 | 0 | -1 | 0.5 | 1.0 | V | True |
In some cases, it may be desired to skip the validation in its entirety. For example, when you would like to use functionality associated with a Collection but you know that the validation will fail anyway and you do not want update the data to make it pass. Then the validation can be turned off by setting the ‘SKIP’ flag in the configuration to True like below. Use at your own risk!
geost.config.validation.SKIP = True
collection = (
df_invalid.gst.to_collection()
) # Will not raise any warning, validation is skipped
collection.data # The data in the result will be unchanged
| nr | x | y | surface | end | top | bottom | lithoclass | |
|---|---|---|---|---|---|---|---|---|
| 0 | A | 100 | 200 | 1 | 0 | 0.0 | 0.5 | K |
| 1 | A | 100 | 200 | 1 | 0 | 1.1 | 1.0 | Z |
| 2 | B | 150 | 250 | 0 | -1 | 0.0 | 0.5 | L |
| 3 | B | 150 | 250 | 0 | -1 | 0.5 | 1.0 | V |
Advanced: The ValidationResult object#
The ValidationResult object contains information on validation problems of rows and surveys and has methods to display or handle validation problems. You can get access to a ValidationResult as follows:
# Validate df_invalid and also return the validation result details
validated_df, validation_result = df_invalid.gst.validate(return_result=True)
✅ Invalid rows were flagged with an 'is_valid' column because geost.config.validation.FLAG_INVALID=True and geost.config.validation.DROP_INVALID=False
📖 See the user guide section on validation for advanced handling of validation issues: https://deltares-research.github.io/geost/user_guide/validation.html
/home/runner/work/geost/geost/geost/validation/validate.py:77: ValidationWarning:
============================================================
⚠️ VALIDATION ISSUE (1/1)
============================================================
Column : 'top, bottom'
Message : Column 'top' must be less than 'bottom', but some rows violate this condition.
# surveys : 1
# rows : 1
============================================================
warnings.warn(
The validation result looks like this:
validation_result
ValidationResult(num_issues=1)
The validation result has some useful properties and methods that can be used to further troubleshoot and fix validation problems:
# Check if there are any validation errors
validation_result.has_errors
True
# Check which surveys are affected by validation errors
validation_result.error_nrs
['A']
# The Pandas Index of the invalid layers...
validation_result.error_indices
# ... can be used to get only the invalid layers
invalid_layers = validated_df.loc[validation_result.error_indices]
invalid_layers
| nr | x | y | surface | end | top | bottom | lithoclass | is_valid | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | A | 100 | 200 | 1 | 0 | 1.1 | 1.0 | Z | False |
# Display warnings
validation_result.display_warnings()
/home/runner/work/geost/geost/geost/validation/validate.py:77: ValidationWarning:
============================================================
⚠️ VALIDATION ISSUE (1/1)
============================================================
Column : 'top, bottom'
Message : Column 'top' must be less than 'bottom', but some rows violate this condition.
# surveys : 1
# rows : 1
============================================================
warnings.warn(
# Use validation result to fix problems in a dataframe
validation_result.handle_errors(df_invalid)
✅ Invalid rows were flagged with an 'is_valid' column because geost.config.validation.FLAG_INVALID=True and geost.config.validation.DROP_INVALID=False
📖 See the user guide section on validation for advanced handling of validation issues: https://deltares-research.github.io/geost/user_guide/validation.html
| nr | x | y | surface | end | top | bottom | lithoclass | is_valid | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 100 | 200 | 1 | 0 | 0.0 | 0.5 | K | True |
| 1 | A | 100 | 200 | 1 | 0 | 1.1 | 1.0 | Z | False |
| 2 | B | 150 | 250 | 0 | -1 | 0.0 | 0.5 | L | True |
| 3 | B | 150 | 250 | 0 | -1 | 0.5 | 1.0 | V | True |