Data structures#

GeoST uses standardized internal data structures and data validation to ensure that the functionality GeoST offers can always reliably be applied. Before we go into the relevant data structures we introduce some terminology. Subsurface data comes in many different forms (e.g. boreholes, CPTs or 3D models) but we distinguish between two main types:

  • Survey: the actual measurements (i.e. raw data) of the subsurface. These can comprise boreholes, CPTs, Well logs, Seismic or EM lines and others.

  • Model: for example a 3D voxel- or layermodel or a geological map. These are the result of analyses using survey data (e.g. Kriging interpolation) and as such are considered to be an interpretation of the subsurface.

GeoST provides different data structures to handle these different kinds of data. These will be shown in this user guide section.

Collection#

For survey data the core data structure is a geost.Collection which holds the data together in a header and data table. Basically, these tables contain:

  • header: metadata and spatial information per survey.

  • data: contains the logged data of surveys.

The header and data tables have a one-to-many relationship: one survey (e.g. borehole) is one row in the header and multiple rows in the data.

Typically available types of subsurface data comprise point-like data such as boreholes, cpts, well logs or line-like data such as seismics, GPR, EM. These can all be held in a Collection. By default, the available read functions for different types of survey data return a Collection (see: Survey data).

While working with a Collection, selections change the contents of the header and data tables but the Collection automatically maintains alignment between the two. Therefore, users can safely select and analyse the data while being sure of consistency. It is recommended to work with a Collections by default, unless you specifically only need to work with either the header or data table. Let’s first check out a Collection containing borehole data:

import geost

# Load the Utrecht Science Park example borehole data
collection = geost.data.boreholes_usp()
collection
Collection
  header (rows, columns) : (67, 5)
  data (rows, columns)   : (1398, 32)
crs: Amersfoort / RD New
vertical datum: NAP height

Overview of attributes and methods#

The Collection class implements many different attributes as well as analysis and selection methods. A short summary of relevant attributes and methods is presented here. A full overview can be found Collection API reference.

Attributes

  • data: Data table of the Collection

  • header: Header table of the Collection

  • crs: Current coordinate reference system (CRS) of the Collection

  • vertical_datum: Current vertical datum of the Collection

Reference systems

  • set_crs: Set the CRS of the collection

  • to_crs: Convert current collection CRS to the specified CRS

  • set_vertical_datum: Similar to set_crs, but for vertical datum (e.g. NAP = EPSG:5709)

  • to_vertical_datum: Similar to to_crs, bit for vertical datum. Currently not implemented.

Analysis

  • get_cumulative_thickness: Returns the cumulative thickness of any specified criteria.

  • get_layer_top: Return the top depth at which a specified layer occurs.

Spatial selections

  • select_within_bbox - Select data points in the Collection within a bounding box

  • select_with_points - Select data points in the Collection within distance to other point geometries

  • select_with_lines - Select data points in the Collection within distance from line geometries

  • select_within_polygons - Select data points in the Collection within polygon geometries

Conditional selections

  • select_by_values - Select data points in the Collection based on the presence of certain values in one or more of the data columns

  • select_by_length - Select data points in the Collection based on length requirements

  • select_by_depth - Select data points in the Collection based on depth constraints

Slicing

  • slice_depth_interval - Slice boreholes in the Collection down to the specified depth interval

  • slice_by_values - Slice boreholes in the Collection based on value (e.g. only sand layers, remove others).

Header table#

In a Collection, the header table is always a geopandas.GeoDataFrame instance. The header holds the spatial location of each survey in the “geometry” column, and metadata such as the surface level, end-depth and others. Let’s check out the header:

print(f"Type of header: {type(collection.header)}")
collection.header.head()
Type of header: <class 'geopandas.geodataframe.GeoDataFrame'>
nr x y surface geometry
0 B31H0541 139585 456000 1.20 POINT (139585 456000)
1 B31H0611 139600 455060 1.20 POINT (139600 455060)
2 B31H0718 139950 455200 1.30 POINT (139950 455200)
3 B31H0803 139675 455087 2.16 POINT (139675 455087)
4 B31H0806 139684 455384 1.00 POINT (139684 455384)

In the case of the Collection above, the geometry column contains shapely.Point geometries. In case of 2D line data (e.g. seismic data) this would contain shapely.LineString geometries. Each entry in the header (row in the Geodataframe) corresponds to one specific survey: one borehole or one seismic line.

The header is not limited to just the columns you see above. Any number of columns can be added to the header to provide additional information on surveys. Different analysis methods can be used for this, see for example the Collection.spatial_join method.

If you’re only interested in survey locations and/or metadata, it is adviced to work directly with the header object to avoid the (small) additional overhead caused by a Collection object. This overhead is caused by checks of the header against data after operations to ensure alignmen between the two. If you are only working with a header or data table, GeoST functionality is still available through a DataFrame and GeoDataFrame accessor which we will cover in a later section.

Data table#

The data table is generally a pandas.DataFrame instance, allthough this may also be a GeoDataFrame. The data table holds all the logged data of any survey. Let’s first check it out:

print(f"Type of data: {type(collection.data)}")
collection.data
Type of data: <class 'pandas.DataFrame'>
nr x y surface end top bottom lith zm zmk ... cons color lutum_pct plants shells kleibrokjes strat_1975 strat_2003 strat_inter desc
0 B31H0541 139585 456000 1.200 -9.900 0.00 0.20 K NaN NaN ... NaN ON NaN 0 0 0 NaN EC NaN [TEELAARDE#***#****#*] ..........................
1 B31H0541 139585 456000 1.200 -9.900 0.20 0.60 K NaN NaN ... NaN BR NaN 0 0 0 NaN EC NaN [KLEI#***#****#*] grysbruin.
2 B31H0541 139585 456000 1.200 -9.900 0.60 0.95 V NaN NaN ... NaN BR NaN 0 0 0 NaN NI NaN [VEEN#***#****#*] donkerbruin.
3 B31H0541 139585 456000 1.200 -9.900 0.95 2.80 Z NaN ZMFO ... NaN GR NaN 0 0 0 NaN EC NaN [ZAND#***#****#*] FYN TOT matig fyn# iets slib...
4 B31H0541 139585 456000 1.200 -9.900 2.80 4.20 Z NaN ZFC ... NaN BR NaN 0 0 0 NaN BXWI NaN [ZAND#***#****#*] fyn# grysbruin.
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1393 B32C1893 141266 455989 2.328 -2.172 0.00 0.50 Z NaN ZZF ... NaN BR NaN 0 0 0 NaN NaN NaN BRON:GEF-BESTAND;0.00;0.50;Zs3h2 PU2;ZZF;BR;
1394 B32C1893 141266 455989 2.328 -2.172 0.50 1.10 Z NaN ZZF ... NaN GR NaN 0 0 0 NaN NaN NaN BRON:GEF-BESTAND;0.50;1.10;Zs3 KLE2;ZZF;GR DO;
1395 B32C1893 141266 455989 2.328 -2.172 1.10 1.40 K NaN NaN ... CMST GR NaN 0 0 0 NaN NaN NaN BRON:GEF-BESTAND;1.10;1.40;Ks2h1;;GR TBR;KMST
1396 B32C1893 141266 455989 2.328 -2.172 1.40 2.10 Z NaN ZZF ... NaN GR NaN 0 0 0 NaN NaN NaN BRON:GEF-BESTAND;1.40;2.10;Zs2 SLI1;ZZF;GR DO;
1397 B32C1893 141266 455989 2.328 -2.172 2.10 4.50 Z NaN ZZF ... NaN GR NaN 0 0 0 NaN NaN NaN BRON:GEF-BESTAND;2.10;4.50;Zs2;ZZF;GR;

1398 rows × 32 columns

Here we see that a single survey contains multiple rows, which in this case are “layers” as we are looking at borehole data. With survey data we can generally distinguish between “layered” and “discrete” data:

  • Layered data contains data that is logged in terms of layers (i.e. depth intervals over which properties are the same) with “top” and “bottom” information for each layer.

  • Discrete data contains data that is logged over discrete intervals (e.g. every 20 cm) with “depth” information for each depth interval.

GeoST can treat both these types of data interchangeably and we provide ways to combine one with the other. As with the header table, the data table is not limited to columns you see above and any number of columns can be added during analysis.

Also, if you are only interested in the survey measurements and don’t need to work with geometries or any other additional header data, it is adviced to directly work with the data table to avoid the overhead from a Collection (i.e. maintaining header/data alignment). See the accessor section. Several read functions provide the option to return a DataFrame instead of a Collection by setting as_collection=False when using the function.

Positional columns#

GeoST requires that several data columns are present to ensure that the methods in a Collection will work, which are referred to as “positional columns”. The required columns differ per type of method. For example, Collection.slice_depth_interval requires depth information about the surface level of surveys and the depth of layers while Collection.select_with_points requires a valid geometry. The presence of depth information or a valid geometry is needed for both methods to work however, both presences are optional. The method slice_depth_interval does not need a geometry to work and select_with_points does not need depth information. Therefore, their presence is only required when you want to use one of these methods. This was chosen as design to ensure the most flexibility for users with different needs. The only mandatory presence is a column which identifies each individual survey (e.g. “nr”) in both the header and data table.

Note

See the Positional columns section on the Survey data page of this User guide for specific information on positional columns.

GeoST accessor#

When you only need to work with one of the header or data tables, all the methods available in Collections are also available when you work directly with the header GeoDataFrame and data DataFrame. This is achieved through a so-called “accessor”. Under the hood, every Collection method also uses this accessor.

Below we demonstrate shortly how these work by comparing the usage of Collection methods with those from the accessors. The accessor can be by calling the .gst on a GeoDataFrame or DataFrame.

# Create separate `header` and `data` variables for the demonstration.
header = collection.header
data = collection.data

print(header.gst)
print(data.gst)
<geost.accessor.GeostFrame object at 0x7f750d5ad220>
<geost.accessor.GeostFrame object at 0x7f750d593150>

As you can see, for both the header and data table it prints “geost.accessor.GeostFrame object”. This is the object where most GeoST methods are actually implemented and why any method available in a Collection is also accessible through the .gst accessor.

Let’s first compare the usage of the select_within_bbox which is method that typically operates on the header table since it is a spatial method and the header contains the spatial information about the surveys. As the name suggests, this selects the surveys which are located within a specific bounding box extent. All we need to do to call this method on the “header” GeoDataFrame is using the .gst in between as shown below.

collection_select = collection.select_within_bbox(139_500, 455_000, 140_000, 455_500)
header_select = header.gst.select_within_bbox(139_500, 455_000, 140_000, 455_500)

print("collection_select.data:", collection_select.header, sep="\n")
print("header_select:", header_select, sep="\n")

header_select.gst  # Selection result also has the .gst accessor and methods available
collection_select.data:
          nr       x       y  surface               geometry
1   B31H0611  139600  455060    1.200  POINT (139600 455060)
2   B31H0718  139950  455200    1.300  POINT (139950 455200)
3   B31H0803  139675  455087    2.160  POINT (139675 455087)
4   B31H0806  139684  455384    1.000  POINT (139684 455384)
5   B31H0807  139684  455405    1.000  POINT (139684 455405)
8   B31H0810  139901  455401    1.000  POINT (139901 455401)
19  B31H1694  139520  455125    1.200  POINT (139520 455125)
20  B31H1695  139660  455290    1.100  POINT (139660 455290)
21  B31H1696  139820  455020    1.300  POINT (139820 455020)
25  B31H3284  139621  455080    2.010  POINT (139621 455080)
26  B31H3285  139521  455181    2.157  POINT (139521 455181)
27  B31H3286  139636  455194    1.492  POINT (139636 455194)
28  B31H3287  139647  455396    1.430  POINT (139647 455396)
29  B31H3288  139534  455397    1.560  POINT (139534 455397)
header_select:
          nr       x       y  surface               geometry
1   B31H0611  139600  455060    1.200  POINT (139600 455060)
2   B31H0718  139950  455200    1.300  POINT (139950 455200)
3   B31H0803  139675  455087    2.160  POINT (139675 455087)
4   B31H0806  139684  455384    1.000  POINT (139684 455384)
5   B31H0807  139684  455405    1.000  POINT (139684 455405)
8   B31H0810  139901  455401    1.000  POINT (139901 455401)
19  B31H1694  139520  455125    1.200  POINT (139520 455125)
20  B31H1695  139660  455290    1.100  POINT (139660 455290)
21  B31H1696  139820  455020    1.300  POINT (139820 455020)
25  B31H3284  139621  455080    2.010  POINT (139621 455080)
26  B31H3285  139521  455181    2.157  POINT (139521 455181)
27  B31H3286  139636  455194    1.492  POINT (139636 455194)
28  B31H3287  139647  455396    1.430  POINT (139647 455396)
29  B31H3288  139534  455397    1.560  POINT (139534 455397)
<geost.accessor.GeostFrame at 0x7f750d542e90>

We can see that the selection result is exactly the same and after selecting from the header table, the .gst accessor remains available in the selection result for making further selections or chaining several selection methods for example.

Also with methods that would typically operate on the data table, for example slice_by_values, it works exactly the same way. All we need to do to call this method on the “data” DataFrame is to use the .gst in between as shown below.

# Select boreholes which contain clay anywhere as the main lithology.
collection_select = collection.slice_by_values("lith", "K")
data_select = data.gst.slice_by_values("lith", "K")

print("collection_select.data:", collection_select.data[["nr", "lith"]], sep="\n")
print("data_select:", data_select[["nr", "lith"]], sep="\n")

data_select.gst  # Selection result also has the .gst accessor and methods available
collection_select.data:
            nr lith
0     B31H0541    K
1     B31H0541    K
8     B31H0611    K
9     B31H0611    K
27    B31H0718    K
...        ...  ...
1370  B32C1881    K
1375  B32C1889    K
1384  B32C1891    K
1385  B32C1891    K
1395  B32C1893    K

[261 rows x 2 columns]
data_select:
            nr lith
0     B31H0541    K
1     B31H0541    K
8     B31H0611    K
9     B31H0611    K
27    B31H0718    K
...        ...  ...
1370  B32C1881    K
1375  B32C1889    K
1384  B32C1891    K
1385  B32C1891    K
1395  B32C1893    K

[261 rows x 2 columns]
<geost.accessor.GeostFrame at 0x7f750d51a2a0>

Again the selection result is exactly the same and after the selection, the .gst accessor remains available.

Model data#

GeoST also supports working with model data and offers methods to combine these data with point and line data. Model data does not follow the same header/data approach as point and line data. Instead there are generic model classes, of which some have an implementation that adds specific functionality for that model. An example of this is the geost.models.VoxelModel as a generic model class and GeoTOP being a specific implementation of a VoxelModel. GeoST currently supports the following generic models and implementations:

Generic models and implementations

  • VoxelModel: Class for 3D voxel models, with data stored in the ds attribute, an Xarray.Dataset.

  • LayerModel: Class for layer models, not yet implemented

    • Implementations: None

GeoST vmodel object hierarchy

Voxel models#

The VoxelModel class stores data in the ds attribute, which is an Xarray.Dataset. A custom voxel model can be instantiated from a NetCDF file. For this, see the documentation of the VoxelModel.from_netcdf class constructor. An instance of VoxelModel offers basic methods for selecting, slicing and exporting models.

For more guidance on using a Voxel model within GeoST, see the BRO GeoTOP section in the user guide.