{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data structures\n", "GeoST uses standardized internal data structures and data validation to ensure that the\n", "functionality GeoST offers can always reliably be applied. Before we go into the relevant data structures we introduce some terminology. Subsurface data comes in many different forms (e.g. boreholes, CPTs or 3D models) but we distinguish between two main types:\n", "\n", "- **Survey**: the actual measurements (i.e. raw data) of the subsurface. These can comprise\n", "boreholes, CPTs, Well logs, Seismic or EM lines and others.\n", "- **Model**: for example a 3D voxel- or layermodel or a geological map. These are the result\n", "of analyses using **survey** data (e.g. Kriging interpolation) and as such are considered to\n", "be an interpretation of the subsurface.\n", "\n", "GeoST provides different data structures to handle these different kinds of data. These will\n", "be shown in this user guide section.\n", "\n", "## Collection\n", "For **survey** data the core data structure is a [`geost.Collection`](../api_reference/collection.rst) which holds the data together in a header and data table. Basically, these tables contain:\n", "\n", "* **header**: metadata and spatial information per survey.\n", "* **data**: contains the logged data of surveys.\n", "\n", "The header and data tables have a one-to-many relationship: one survey (e.g. borehole) is\n", "one row in the header and multiple rows in the data. \n", "\n", "Typically available types of subsurface data comprise point-like data such as boreholes,\n", "cpts, well logs or line-like data such as seismics, GPR, EM. These can all be held in a\n", "Collection. By default, the available read functions for different types of survey data\n", "return a Collection (see: [Survey data](./survey_data.ipynb)).\n", "\n", "While working with a Collection, selections change the contents of the header and data tables\n", "but the Collection automatically maintains alignment between the two. Therefore, users can safely\n", "select and analyse the data while being sure of consistency. It is recommended to\n", "work with a Collections by default, unless you specifically only need to work with either the\n", "header or data table. Let's first check out a `Collection` containing borehole data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geost\n", "\n", "# Load the Utrecht Science Park example borehole data\n", "collection = geost.data.boreholes_usp()\n", "collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview of attributes and methods\n", "The [`Collection`](../api_reference/collection.rst) class implements many different attributes\n", "as well as analysis and selection methods. A short summary of relevant attributes and methods\n", "is presented here. A full overview can be found [Collection API reference](../api_reference/collection.rst).\n", "\n", "**Attributes**\n", "- `data`: Data table of the Collection\n", "- `header`: Header table of the Collection\n", "- `crs`: Current coordinate reference system (CRS) of the Collection\n", "- `vertical_datum`: Current vertical datum of the Collection\n", "\n", "**Reference systems**\n", "- `set_crs`: Set the CRS of the collection\n", "- `to_crs`: Convert current collection CRS to the specified CRS\n", "- `set_vertical_datum`: Similar to set_crs, but for vertical datum (e.g. NAP = EPSG:5709)\n", "- `to_vertical_datum`: Similar to to_crs, bit for vertical datum. Currently not implemented.\n", "\n", "**Analysis**\n", "- `get_cumulative_thickness`: Returns the cumulative thickness of any specified criteria.\n", "- `get_layer_top`: Return the top depth at which a specified layer occurs.\n", "\n", "**Spatial selections**\n", "- `select_within_bbox` - Select data points in the Collection within a bounding box\n", "- `select_with_points` - Select data points in the Collection within distance to other point geometries\n", "- `select_with_lines` - Select data points in the Collection within distance from line geometries\n", "- `select_within_polygons` - Select data points in the Collection within polygon geometries\n", "\n", "**Conditional selections**\n", "- `select_by_values` - Select data points in the Collection based on the presence of certain values in one or more of the data columns\n", "- `select_by_length` - Select data points in the Collection based on length requirements \n", "- `select_by_depth` - Select data points in the Collection based on depth constraints\n", "\n", "**Slicing**\n", "- `slice_depth_interval` - Slice boreholes in the Collection down to the specified depth interval\n", "- `slice_by_values` - Slice boreholes in the Collection based on value (e.g. only sand layers, remove others)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Header table\n", "In a Collection, the header table is always a [`geopandas.GeoDataFrame`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html) instance. The header holds the spatial location of each survey in the \"geometry\" column, and metadata such as the surface level, end-depth and others. Let's check out the header:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Type of header: {type(collection.header)}\")\n", "collection.header.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case of the `Collection` above, the geometry column contains [`shapely.Point`](https://shapely.readthedocs.io/en/stable/reference/shapely.Point.html) geometries. In case of 2D line\n", "data (e.g. seismic data) this would contain `shapely.LineString` geometries. Each entry\n", "in the header (row in the Geodataframe) corresponds to one specific survey: one borehole or one seismic line.\n", "\n", "The header is not limited to just the columns you see above. Any number of columns can be added\n", "to the header to provide additional information on surveys. Different analysis methods can be used for\n", "this, see for example the [`Collection.spatial_join`](../api_reference/generated/geost.base.Collection.spatial_join.rst) method.\n", "\n", "If you're only interested in survey locations and/or metadata, it is adviced to work directly\n", "with the header object to avoid the (small) additional overhead caused by a Collection \n", "object. This overhead is caused by checks of the header against data after operations to \n", "ensure alignmen between the two. If you are only working with a header or data table, GeoST functionality is still available through a `DataFrame` and `GeoDataFrame` accessor which we will cover in a later [section](#geost-accessor).\n", "\n", "### Data table\n", "The data table is generally a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) instance, allthough this may also be a `GeoDataFrame`. The data table holds all the logged data of any survey. Let's first check it out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Type of data: {type(collection.data)}\")\n", "collection.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that a single survey contains multiple rows, which in this case are \"layers\" as we are\n", "looking at borehole data. With survey data we can generally distinguish between **\"layered\"** and **\"discrete\"** data:\n", "\n", "* **Layered** data contains data that is logged in terms of layers (i.e. depth intervals over which properties are the same) with **\"top\"** and **\"bottom\"** information for each layer.\n", "* **Discrete** data contains data that is logged over discrete intervals (e.g. every 20 cm) with **\"depth\"** information for each depth interval. \n", "\n", "GeoST can treat both these types of data interchangeably and we provide ways to combine one\n", "with the other. As with the header table, the data table is not limited to columns you see\n", "above and any number of columns can be added during analysis.\n", "\n", "Also, if you are only interested in the survey measurements and don't need to work with geometries or any other additional header data, it is adviced to directly work with the data table to avoid the overhead from a Collection (i.e. maintaining header/data alignment). See the [accessor section](#geost-accessor). Several read functions provide the option to return a `DataFrame` instead of a\n", "`Collection` by setting `as_collection=False` when using the function.\n", "\n", "\n", "### Positional columns\n", "GeoST requires that several data columns are present to ensure that the methods in a `Collection` will work, which are referred to as \"positional columns\". The required columns differ per type of method. For example, [`Collection.slice_depth_interval`](../api_reference/generated/geost.base.Collection.slice_depth_interval.rst) requires depth information about the surface level of surveys and the depth of layers while [`Collection.select_with_points`](../api_reference/generated/geost.base.Collection.select_with_points.rst) requires a valid geometry. The presence of depth information or a valid geometry is needed for both methods to work however, both presences are optional. The method `slice_depth_interval` does not need a geometry to work and `select_with_points` does not need depth information. Therefore, their presence is only required when you want to use one of these methods. This was chosen as design to ensure the most flexibility for users with different needs. The only mandatory presence is a column which identifies each individual survey (e.g. \"nr\")\n", "in both the header and data table. \n", "\n", "```{note}\n", "See the [Positional columns](./survey_data.ipynb#positional-columns) section on the\n", "[Survey data](./survey_data.ipynb) page of this User guide for specific information on\n", "positional columns.\n", "```\n", "\n", "## GeoST accessor\n", "When you only need to work with one of the header or data tables, all the methods\n", "available in Collections are also available when you work directly with the header [GeoDataFrame](https://geopandas.org/en/stable/docs/reference/geodataframe.html) and data [DataFrame](https://pandas.pydata.org/docs/reference/frame.html). This is achieved through a so-called \"accessor\". Under the hood, every Collection method also uses this accessor.\n", "\n", "Below we demonstrate shortly how these work by comparing the usage of Collection methods with those from the accessors. The accessor can be by calling the [`.gst`](../api_reference/accessor.rst)\n", "on a `GeoDataFrame` or `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create separate `header` and `data` variables for the demonstration.\n", "header = collection.header\n", "data = collection.data\n", "\n", "print(header.gst)\n", "print(data.gst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, for both the header and data table it prints \"geost.accessor.GeostFrame object\".\n", "This is the object where most GeoST methods are actually implemented and why any method available\n", "in a Collection is also accessible through the `.gst` accessor.\n", "\n", "Let's first compare the usage of the [`select_within_bbox`](../api_reference/generated/geost.accessor.GeostFrame.select_within_bbox.rst) which is method that typically operates on the header\n", "table since it is a spatial method and the header contains the spatial information about the\n", "surveys. As the name suggests, this selects the surveys which are located within a specific\n", "bounding box extent. All we need to do to call this method on the \"header\" `GeoDataFrame` is\n", "using the `.gst` in between as shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "collection_select = collection.select_within_bbox(139_500, 455_000, 140_000, 455_500)\n", "header_select = header.gst.select_within_bbox(139_500, 455_000, 140_000, 455_500)\n", "\n", "print(\"collection_select.data:\", collection_select.header, sep=\"\\n\")\n", "print(\"header_select:\", header_select, sep=\"\\n\")\n", "\n", "header_select.gst # Selection result also has the .gst accessor and methods available" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the selection result is exactly the same and after selecting from the header\n", "table, the `.gst` accessor remains available in the selection result for making further\n", "selections or chaining several selection methods for example.\n", "\n", "Also with methods that would typically operate on the data table, for example [`slice_by_values`](../api_reference/generated/geost.accessor.GeostFrame.slice_by_values.rst), it works exactly the same\n", "way. All we need to do to call this method on the \"data\" `DataFrame` is to use the [`.gst`](../api_reference/accessor.rst) in between as shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select boreholes which contain clay anywhere as the main lithology.\n", "collection_select = collection.slice_by_values(\"lith\", \"K\")\n", "data_select = data.gst.slice_by_values(\"lith\", \"K\")\n", "\n", "print(\"collection_select.data:\", collection_select.data[[\"nr\", \"lith\"]], sep=\"\\n\")\n", "print(\"data_select:\", data_select[[\"nr\", \"lith\"]], sep=\"\\n\")\n", "\n", "data_select.gst # Selection result also has the .gst accessor and methods available" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again the selection result is exactly the same and after the selection, the [`.gst`](../api_reference/accessor.rst) accessor remains available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model data\n", "GeoST also supports working with model data and offers methods to combine these data with\n", "point and line data. Model data does not follow the same header/data approach as point\n", "and line data. Instead there are generic model classes, of which some have an\n", "implementation that adds specific functionality for that model. An example of this is\n", "the [`geost.models.VoxelModel`](../api_reference/voxelmodel.rst) as a generic model class and [`GeoTOP`](../api_reference/bro_geotop.rst) being a specific implementation of a `VoxelModel`.\n", "GeoST currently supports the following generic models and implementations:\n", "\n", "**Generic models and implementations**\n", "* [`VoxelModel`](../api_reference/voxelmodel.rst): Class for 3D voxel models, with data \n", "stored in the `ds` attribute, an [`Xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html).\n", " * Implementations: [`GeoTOP`](../api_reference/bro_geotop.rst)\n", "* *`LayerModel`*: Class for layer models, not yet implemented\n", " * Implementations: None\n", "\n", "
\n",
"
\n",
"