{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data structures\n", "GeoST uses standardized internal data structures and data validation to ensure that the\n", "functionality that GeoST offers can always reliably be applied. This user guide section \n", "dives deeper into GeoST data structures.\n", "\n", "## Collection objects\n", "As shown in the first [introduction](../getting_started/introduction.ipynb#concept) to GeoST,\n", "data is held in so-called `Collection` objects, the core objects of GeoST, which contain header\n", "and data tables. Basically, the two can be described as:\n", "\n", "* the *header table* describes metadata and spatial information.\n", "* the *data table* contains the logged data.\n", "\n", "The header and data tables have a one-to-many relationship: one survey (e.g. borehole) is\n", "one row in the header and multiple rows in the data. \n", "\n", "Typically available types of subsurface data comprise point-like data such as boreholes,\n", "cpts, well logs and line-like data such as seismics, GPR, EM. Different data sources are\n", "related to specific Collection objects. For example, borehole data is held in a\n", "[`BoreholeCollection`](../api_reference/borehole_collection.rst) and CPT data in a\n", "[`CptCollection`](../api_reference/cpt_collection.rst) (see figure below). \n", "\n", "

\n", " \"GeoST\n", "

\n", "\n", "While working with a Collection, making selections may alter the header and data tables,\n", "Collections automatically maintain alignment between the two. Therefore, users can safely\n", "make selections and analyse the data while being sure of consistency. It is recommended to\n", "work with collections by default, unless you specifically only need to work with the header\n", "or data table. By default, read functions for different types of data return a collection\n", "(see: [Reading data](./reading_data.ipynb)). So for example, reading sample data of boreholes\n", "available in GeoST shows that the resulting object is a BoreholeCollection. Additionally, we\n", "show that a Collection also contains horizontal and vertical spatial references." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geost\n", "\n", "# Load the Utrecht Science Park example borehole data\n", "boreholes_collection = geost.data.boreholes_usp()\n", "\n", "# boreholes_collection is an instance of BoreholeCollection and contains 67 boreholes\n", "print(boreholes_collection)\n", "\n", "# Print data types of header and data attributes\n", "print(f\"Data type header: {type(boreholes_collection.header)}\")\n", "print(f\"Data type data: {type(boreholes_collection.data)}\")\n", "\n", "# Print the horizontal and vertical reference systems\n", "print(boreholes_collection.horizontal_reference)\n", "print(boreholes_collection.vertical_reference)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Header table\n", "Header tables are a Geopandas [`GeoDataFrame`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html) instance and hold spatial information, in a \"geometry\" column, and\n", "metadata such as the surface level, end-depth and others. The geometry column of the header\n", "contains point geometries case of boreholes and CPTs and linestring geometries in for instance\n", "seismic data. Each entry (row in the Geodataframe) corresponds to one specific survey:\n", "e.g. one borehole or one seismic line. \n", "\n", "A header table requires a bare minimum of data columns to be present to ensure that all\n", "built-in methods of a Collection can be used:\n", "\n", "| Column name | Validation criteria | Description |\n", "| ----------- | ------------------- | ----------- |\n", "| nr | Must be interpretable as string | Identification name/number/code of the point survey |\n", "| x | Must be of numeric type (int or float) | X-coordinate |\n", "| y | Must be of numeric type (int or float) | Y-coordinate |\n", "| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m +NAP |\n", "| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m +NAP |\n", "| geometry | `shapely.geometry.Point` in case of point data | Geometry object of the survey location |\n", "\n", "The header is not limited to just these columns. Any number of columns can be added to give\n", "additional information on surveys. Some analysis methods may add information to the header. For instance, the method [`BoreholeCollection.get_area_labels`](../api_reference/generated/geost.base.BoreholeCollection.get_area_labels.rst) has an argument `include_in_header` which, if set\n", "to true, adds a column with results to the header GeoDataFrame. Otherwise, it will return a separate DataFrame.\n", "\n", "If you're only interested in survey locations and/or metadata, it is adviced to directly\n", "work with the header object to avoid some additional overhead caused by a parent collection \n", "object (overhead is caused by checks of the header against data after every operation to \n", "ensure header/data alignment). Read functions for point and line data (see: [Reading data](./reading_data.ipynb)) return a corresponding collection object by default, but you can assign\n", "only the header to a variable in order to continue with just the header data. See the example below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the Utrecht Science Park example borehole data and only assign the header data.\n", "boreholes_header = geost.data.boreholes_usp().header\n", "\n", "# Print the first rows of the header data.\n", "boreholes_header.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data table\n", "Data tables are a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) instance and hold all the logged data of any survey. In GeoST we mainly distinguish between **\"layered\"** and **\"discrete\"** data:\n", "\n", "* *Layered* data contains data that is logged in terms of layers (i.e. depth intervals over which properties are the same) with **\"top\"** and **\"bottom\"** information for each layer.\n", "* *Discrete* data contains data that is logged over discrete intervals (e.g. every 20 cm) with **\"depth\"** information for each measurement. One point or line survey (i.e. one row in the header) can be associated with multiple rows of data. E.g. a single borehole with 10 described layers is represented by one row in the header Geodataframe and ten rows in the data DataFrame. \n", "\n", "Just like the header, a data table also requires a bare minimum of columns to be present to ensure\n", "that all built-in methods of a Collection can be applied. In case of \"layered\" data:\n", "\n", "| Column name | Validation criteria | Description |\n", "| ----------- | ------------------- | ----------- |\n", "| nr | Must be interpretable as string | Identification name/number/code of the point survey |\n", "| x | Must be of numeric type (int or float) | X-coordinate |\n", "| y | Must be of numeric type (int or float) | Y-coordinate |\n", "| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m |\n", "| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m |\n", "| top | Must be of numeric type (int or float); starts at 0; is increasing | Elevation of layer top. The first layer always starts at 0 and increases downwards |\n", "| bottom | Must be of numeric type (int or float); is larger than top; is increasing | Elevation of layer bottom |\n", "\n", "If the table contains inclined data, such as boreholes taken at a specific angle which means the x,y-coordinates of the top of a layer is not exactly at the same location as the bottom, the columns below must additionally be present:\n", "\n", "| Column name | Validation criteria | Description |\n", "| ----------- | ------------------- | ----------- |\n", "| x_bot | Must be of numeric type (int or float) | X-coordinate of layer bottom (only required if survey does not point straight down) |\n", "| y_bot | Must be of numeric type (int or float) | X-coordinate of layer bottom (only required if survey does not point straight down) |\n", "\n", "In case the data table holds \"discrete\" data the columns below must be present to ensure that all built-in methods work. Note that the only difference is the \"depth\" column instead of the \"top\" and \"bottom\" columns.\n", "\n", "| Column name | Validation criteria | Description |\n", "| ----------- | ------------------- | ----------- |\n", "| nr | Must be interpretable as string | Identification name/number/code of the point survey |\n", "| x | Must be of numeric type (int or float) | X-coordinate |\n", "| y | Must be of numeric type (int or float) | Y-coordinate |\n", "| surface | Must be of numeric type (int or float) and higher than end depth | Surface elevation of the point survey in m |\n", "| end | Must be of numeric type (int or float) and lower than surface elevation | End depth of the point survey in m |\n", "| depth | Must be of numeric type (int or float); is increasing | Depth where the measurement was taken |\n", "\n", "Also the data table is not limited to the columns above and all additional columns contain the actual data with measurements for each layer or at each depth.\n", "\n", "If you're only interested in the measurements and don't need to work with geometries or\n", "any other additional header data, it is adviced to directly work with the data table to \n", "avoid some additional overhead caused by a Collection object (overhead is caused by \n", "checks of the header against data after every operation to ensure header/data alignment). \n", "The different read functions for data (see: [Reading data](./reading_data.ipynb))\n", "return a corresponding collection object by default, but you can assign only the Pandas `DataFrame` of the data table is returned to continue with just the data. See the example below. Some\n", "read functions, such as [`read_borehole_table`](../api_reference/generated/geost.read_borehole_table.rst) provide the argument `as_collection` which defaults to True, but can be set to False to\n", "only return the data table in this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the Utrecht Science Park example borehole data and only assign the data.\n", "boreholes_data = geost.data.boreholes_usp().data\n", "\n", "# Print the first few rows of boreholes data.\n", "boreholes_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GeoST Accessors\n", "When you only need to work with one of the header or data tables, all the methods\n", "available in Collections are also available to the header [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/geodataframe.html) and data [DataFrame](https://pandas.pydata.org/docs/reference/frame.html) tables. This is achieved through so-called \"accessors\". Under the hood, every Collection method also uses these accessors. Therefore, some methods specifically operate on the header table and others on the data table. The Collection then resolves the alignment between the two afterwards.\n", "\n", "For the header table and associated header methods, the [`.gsthd`](../api_reference/header_accessors.rst) accessor is available and for the data table, the [`.gstda`](../api_reference/data_accessors.rst) accessor is available. Below we demonstrate shortly how these work by comparing the usage of Collection methods with those from the accessors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create separate `collection`, `header` and `data` variables for the demonstration.\n", "collection = geost.data.boreholes_usp()\n", "header = collection.header\n", "data = collection.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first compare the usage of the [`select_within_bbox`](../api_reference/generated/geost.base.BoreholeCollection.select_within_bbox.rst) which is a method that operates on the header table. As the name suggests, this selects the surveys which are located within a specific bounding box extent. All we need to do to call this method on the \"header\" GeoDataFrame is using the [`.gsthd`](../api_reference/header_accessors.rst) in between as shown below. After selecting from the header table, the [`.gsthd`](../api_reference/header_accessors.rst) accessor remains available in the selection result for making further selections or chaining selections for example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "collection_select = collection.select_within_bbox(139_500, 455_000, 140_000, 455_500)\n", "header_select = header.gsthd.select_within_bbox(139_500, 455_000, 140_000, 455_500)\n", "\n", "print(collection_select) # Selection result is a BoreholeCollection\n", "print(type(header_select)) # Selection result is a GeoDataFrame\n", "\n", "header_select.gsthd # Selection result also has the gsthd accessor and methods available" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the data accessor, it works exactly the same way. We demonstrate this by comparing the [`slice_by_values`](../api_reference/generated/geost.base.BoreholeCollection.slice_by_values.rst) method, which operates on the data table. Just like with the header, all we need to do to call this method on the \"data\" DataFrame is using the [`.gstda`](../api_reference/data_accessors.rst) in between as shown below. After selecting from the data table, the [`.gstda`](../api_reference/data_accessors.rst) accessor remains available for making further selections or chaining selections for example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select boreholes which contain sand anywhere as the main lithology.\n", "collection_select = collection.slice_by_values(\"lith\", \"Z\")\n", "data_select = data.gstda.slice_by_values(\"lith\", \"Z\")\n", "\n", "print(collection_select) # Selection result is a BoreholeCollection\n", "print(type(data_select)) # Selection result is a GeoDataFrame\n", "\n", "data_select.gstda # Selection result also has the gstda accessor and methods available" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every header and data method can be accessed through these accessors like it was shown in the\n", "examples above. Please see API Reference for the available methods through the [`.gsthd`](../api_reference/header_accessors.rst) and through the [`.gstda`](../api_reference/data_accessors.rst) accessors. For more detailed information on how the GeoST accessors work and the usage, please see\n", "the [GeoST accessors](./accessors.ipynb) page in this User guide." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model data\n", "GeoST also supports working with model data and offers methods to combine these data with\n", "point and line data. Model data does not follow the same header/data approach as point\n", "and line data. Instead there are generic model classes, of which some have an\n", "implementation that adds specific functionality for that model. An example of this is\n", "the [`VoxelModel`](../api_reference/voxelmodel.rst) as a generic model class and [`GeoTOP`](../api_reference/bro_geotop.rst)\n", "being a specific implementation of a voxel model. GeoST currently supports the following \n", "generic models and implementations:\n", "\n", "**Generic models and implementations**\n", "* *[`VoxelModel`](../api_reference/voxelmodel.rst)*: Class for voxel models, with data \n", "stored in the `ds` attribute, an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).\n", " * Implementations: [`GeoTOP`](../api_reference/bro_geotop.rst)\n", "* *`LayerModel`*: Class for layer models, not yet implemented\n", " * Implementations: None\n", "\n", "

\n", " \"GeoST\n", "

\n", "\n", "### Voxel models\n", "The [`VoxelModel`](../api_reference/voxelmodel.rst) class stores data in the `ds` \n", "attribute, which is an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).\n", "A custom voxel model can be instantiated from a NetCDF file. For this, see the documentation of the \n", "[`VoxelModel.from_netcdf`](../api_reference/generated/geost.models.VoxelModel.from_netcdf.rst) class constructor.\n", "An instance of [`VoxelModel`](../api_reference/voxelmodel.rst) offers basic methods for \n", "selecting, slicing and exporting models.\n", "\n", "For more guidance on using a Voxel model within GeoST, see the [BRO GeoTOP](../user_guide/bro_geotop.ipynb)\n", "section in the user guide.\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "default", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 2 }