{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data validation\n", "Validation is automatically triggered when creating header and data objects through reading\n", "data or when (re)setting header and data tables (e.g. this happens when making selections).\n", "\n", "Validation checks that the bare minimum amount of columns are present (see [Data structures](./data_structures.ipynb))\n", "and makes sure that they contain data of the correct datatype (e.g. strings, float, integers).\n", "In addition it tests some basic logic that the table data must adhere to. For example: the 'top'\n", "in layered data must be greater than the 'bottom' of the layer and above 0 as we define layer tops\n", "and bottoms as positive downward starting from 0 in GeoST. Doing this ensures that GeoST functions\n", "always work on strictly defined and valid data that leads to reproducible results.\n", "\n", "## Validation settings\n", "There are a number of global settings that can be set to control the behaviour:\n", "\n", "| Setting | Description | Default |\n", "| --------| ----------- | ------- |\n", "| VERBOSE | If True, the details of validation errors will be printed to the console | True |\n", "| DROP_INVALID | If True, invalid rows will automatically be dropped from (Geo)DataFrames | True |\n", "| FLAG_INVALID | If True, invalid rows will be flagged in (Geo)DataFrames. Only works if DROP_INVALID is False | False |\n", "| AUTO_ALIGN | If True, collection headers and data tables will automatically be aligned | True |\n", "\n", "\n", "You can access and manipulate these settings through the geost config module:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from geost import config\n", "\n", "# E.g. turning off verbose validation warnings\n", "config.validation.VERBOSE = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examples\n", "In the below examples we create a dataframe with some layer data and intentionally\n", "create some problems to show what to expect from the validation and the different settings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LayeredData instance:\n", " nr x y surface end top bottom lithoclass\n", "0 B-01 100.0 200.0 0.0 -1.0 0.0 0.5 K\n", "1 B-01 100.0 200.0 0.0 -1.0 0.5 1.0 Z\n" ] } ], "source": [ "import pandas as pd\n", "\n", "from geost.base import LayeredData\n", "\n", "# A dataframe that describes two layers correctly according to GeoST standards\n", "df_correct = pd.DataFrame(\n", " {\n", " \"nr\": [\"B-01\", \"B-01\"],\n", " \"x\": [100, 100],\n", " \"y\": [200, 200],\n", " \"surface\": [0, 0],\n", " \"end\": [-1, -1],\n", " \"top\": [0, 0.5],\n", " \"bottom\": [0.5, 1],\n", " \"lithoclass\": [\"K\", \"Z\"],\n", " }\n", ")\n", "\n", "# Creating a LayeredData object from this table triggers validation.\n", "# -> All good, no warnings!\n", "layered_data = LayeredData(df_correct)\n", "\n", "print(layered_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we change the top of the second layer to 1.1. This cannot be, because the bottom of \n", "of this layer is 1. We therefore expect a `ValidationWarning`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LayeredData instance:\n", " nr x y surface end top bottom lithoclass\n", "0 B-01 100.0 200.0 0.0 -1.0 0.0 0.5 K\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\onselen\\Development\\geost\\geost\\validation\\validate.py:46: ValidationWarning: \n", "Validation dropped 1 row(s) for schema 'Layer data non-inclined'.\n", "Dropped indices: [1]\n", "\n", " warnings.warn(\n" ] } ], "source": [ "# Create an invalid layer in the example dataframe\n", "df_invalid = df_correct.copy()\n", "df_invalid.loc[1, \"top\"] = 1.1\n", "\n", "# Validation setting VERBOSE on to show warning details\n", "config.validation.VERBOSE = True\n", "\n", "# Creating a LayeredData object from the invalid dataframe\n", "# -> triggers a ValidationWarning\n", "layered_data = LayeredData(df_invalid)\n", "\n", "# Because the setting DROP_INVALID is turned on, the invalid layer is dropped\n", "print(layered_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above example you are warned about the ValidationError. Because the setting DROP_INVALID\n", "is turned on, we receive the message that one row was dropped from the table. As you can see,\n", "the layered_data now includes only the valid row.\n", "\n", "If you don't want to drop rows automatically, turn 'DROP_INVALID' off. Use this at your \n", "own risk as GeoST functions may break because of this. You may also turn on the setting\n", "'FLAG_INVALID' to add a column that indicates whether a row passed through validation or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LayeredData instance:\n", " nr x y surface end top bottom lithoclass\n", "0 B-01 100 200 0 -1 0.0 0.5 K\n", "1 B-01 100 200 0 -1 1.1 1.0 Z\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\onselen\\Development\\geost\\geost\\validation\\validate.py:54: ValidationWarning: \n", "Validation failed for schema 'Layer data non-inclined'.\n", "Details:\n", "DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: > failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z\n", "\n", " warnings.warn(\n" ] } ], "source": [ "config.validation.DROP_INVALID = False\n", "\n", "layered_data = LayeredData(df_invalid)\n", "\n", "# Layered data retains invalid layers, use at your own risk!\n", "print(layered_data)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LayeredData instance:\n", " nr x y surface end top bottom lithoclass is_valid\n", "0 B-01 100 200 0 -1 0.0 0.5 K True\n", "1 B-01 100 200 0 -1 1.1 1.0 Z False\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\onselen\\Development\\geost\\geost\\validation\\validate.py:54: ValidationWarning: \n", "Validation failed for schema 'Layer data non-inclined'.\n", "Details:\n", "DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: > failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z\n", "\n", " warnings.warn(\n" ] } ], "source": [ "config.validation.FLAG_INVALID = True\n", "\n", "layered_data = LayeredData(df_invalid)\n", "\n", "# Layered data retains invalid layers, but a column 'is_valid' is added to indicate validity\n", "print(layered_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced: manually validating data\n", "In principle, there is no need to worry about validating data as the validation is called\n", "automatically when reading, parsing and manipulating data. However, should the need arise to\n", "validate data manually, you can do this by manually applying validation to a dataframe.\n", "This could be useful for example when doing a custom operation on a dataframe and you\n", "are unsure whether the manipulated dataframe is still compatible with GeoST.\n", "\n", "We recommend to use the [`geost.validation.safe_validate`](../api_reference/generated/geost.validation.safe_validate.rst) function as it is \n", "designed to raise warnings (instead of errors) and takes into account the global validation \n", "settings. In addition, you must choose which pre-defined data schema is used. You can find\n", "the available schemas in the module `geost.validation.schemas`. Alternatively, you can also \n", "directly use the [`Pandera DataFrameSchemas`](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html)\n", "found in this module.\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " nr x y surface end top bottom lithoclass\n", "0 B-01 100.0 200.0 0.0 -1.0 0.0 0.5 K\n", "1 B-01 100.0 200.0 0.0 -1.0 0.5 1.0 Z\n", " nr x y surface end top bottom lithoclass is_valid\n", "0 B-01 100 200 0 -1 0.0 0.5 K True\n", "1 B-01 100 200 0 -1 1.1 1.0 Z False\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\onselen\\Development\\geost\\geost\\validation\\validate.py:54: ValidationWarning: \n", "Validation failed for schema 'Layer data non-inclined'.\n", "Details:\n", "DataFrameSchema 'Layer data non-inclined' failed element-wise validator number 0: > failure cases: B-01, 100.0, 200.0, 0.0, -1.0, 1.1, 1.0, Z, False\n", "\n", " warnings.warn(\n" ] } ], "source": [ "from geost.validation import safe_validate, schemas\n", "\n", "# Validate the df_correct dataframe using the layer dataschema\n", "# -> no warning!\n", "df_valid = safe_validate(schemas.layerdata, df_correct)\n", "print(df_valid)\n", "\n", "# Validate the df_invalid dataframe using the layer dataschema\n", "# -> warning, is_valid column added because FLAG_INVALID setting is on.\n", "df_invalid_flagged = safe_validate(schemas.layerdata, df_invalid)\n", "print(df_invalid_flagged)" ] } ], "metadata": { "kernelspec": { "display_name": "default", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 2 }