{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Survey data\n",
    "GeoST offer various functions to read and parse subsurface data to GeoST objects. In general,\n",
    "survey data can either be loaded from (local) files or is requested from a service like\n",
    "the BRO REST-API. Either way, data coming from multiple sources and file formats are support. \n",
    "\n",
    "## Reading survey data\n",
    "By survey data we refer to measurements (i.e. raw data) of the subsurface. These can comprise\n",
    "boreholes, CPTs, Well logs, Seismic or EM lines and others (see [Data structures](./data_structures.ipynb#data-columns) for a more detailed description). In any case, the data is parsed to a [`geost.Collection`](../api_reference/collection.rst). The tables below list the currently supported data sources, associated reader functions and resulting GeoST objects.\n",
    "\n",
    "| File format/data service | Read function  | Returned GeoST object | Description  |\n",
    "| ------------------------ | -------------- | --------------------- | -----------  |    \n",
    "| BHR-G | [`read_bhrg`](../api_reference/generated/geost.read_bhrg.rst) | [`Collection`](../api_reference/collection.rst) | (BRO) Geological boreholes from xml |\n",
    "| BHR-GT | [`read_bhrgt`](../api_reference/generated/geost.read_bhrgt.rst) | [`Collection`](../api_reference/collection.rst) | (BRO) Geotechnical boreholes from xml |\n",
    "| BHR-GT-samples | [`read_bhrgt_samples`](../api_reference/generated/geost.read_bhrgt_samples.rst) | [`Collection`](../api_reference/collection.rst) | (BRO) Geotechnical boreholes - grainsize samples from xml |\n",
    "| BHR-P | [`read_bhrp`](../api_reference/generated/geost.read_bhrp.rst) | [`Collection`](../api_reference/collection.rst) | (BRO) Pedological boreholes from xml |\n",
    "| CPT | [`read_cpt`](../api_reference/generated/geost.read_cpt.rst) [`read_gef_cpts`](../api_reference/generated/geost.read_gef_cpts.rst) |  [`Collection`](../api_reference/collection.rst) | (BRO) Cone Penetration Tests from xml or gef |\n",
    "| SFR | [`read_sfr`](../api_reference/generated/geost.read_sfr.rst) | [`Collection`](../api_reference/collection.rst) | (BRO) Pedological soilprofile descriptions from xml |\n",
    "| BRO REST-API | [`bro_api_read`](../api_reference/generated/geost.bro_api_read.rst) | [`Collection`](../api_reference/collection.rst) | BRO BHR-G, BHR-GT, BHR-GT-samples, BHR-P, CPT or SFR objects |\n",
    "| Parquet or csv | [`read_table`](../api_reference/generated/geost.read_table.rst) [`read_borehole_table`](../api_reference/generated/geost.read_borehole_table.rst) [`read_cpt_table`](../api_reference/generated/geost.read_cpt_table.rst) | [`Collection`](../api_reference/collection.rst) or [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) | Survey data stored as a table. Result of `to_parquet` or `to_csv` export methods.\n",
    "| NLOG excel export | [`read_nlog_cores`](../api_reference/generated/geost.read_nlog_cores.rst) | [`Collection`](../api_reference/collection.rst) or [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) | Reader for NLOG deep cores, see [here](https://www.nlog.nl/boringen) |\n",
    "| UU LLG cores | [`read_uullg_tables`](../api_reference/generated/geost.read_uullg_tables.rst) | [`Collection`](../api_reference/collection.rst) or [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) | Reader for csv distribution of Utrecht University student boreholes |\n",
    "| BORIS XML | [`read_xml_boris`](../api_reference/generated/geost.read_xml_boris.rst) | [`Collection`](../api_reference/collection.rst) or [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html) | Reader for XML exports of the BORIS borehole description software |\n",
    "\n",
    "### Reading data from the BRO REST-API\n",
    "Subsurface data is widely available in the Netherlands via the [portal](https://www.broloket.nl/ondergrondgegevens) of the Basis Registratie Ondergrond (**BRO**). GeoST can directly load\n",
    "this data for an area of interest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import geost\n",
    "\n",
    "# Read a few BRO pedological soil cores in a small area 250 m x 500 m\n",
    "boreholes = geost.bro_api_read(\"BHR-P\", bbox=(142_000, 455_000, 142_250, 455_500))\n",
    "boreholes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that this loads the soil cores as a `geost.Collection`. This is also supported for geological (BHR-G) and geotechnical (BHR-GT) boreholes, cone penetration test (CPT) data and pedological soil profile descriptions (SFR). This facilitates the\n",
    "direct use of BRO data within any application.\n",
    "\n",
    "### Reading from local files\n",
    "A likely option is to use GeoST to load survey data stored in a tabular format such as\n",
    "Parquet or csv and use the available selection and analysis methods. For example, suppose you\n",
    "have survey data for multiple boreholes stored in a local Parquet file. Using\n",
    "[`geost.read_table`](../api_reference/generated/geost.read_table.rst) you can\n",
    "very easily load it into a `Collection` or if preferred, a `pandas.DataFrame` and use the data\n",
    "for further analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "borehole_file = geost.data.boreholes_usp(\n",
    "    return_filepath=True\n",
    ")  # Use the filepath instead of directly reading the borehole data\n",
    "boreholes = geost.read_table(borehole_file)\n",
    "print(boreholes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can easily select the boreholes that contain peat (\"V\") for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "peat_boreholes = boreholes.select_by_values(\"lith\", \"V\")\n",
    "peat_boreholes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same thing is possible too with Pandas DataFrames. Note that with DataFrames, any GeoST\n",
    "methods need to be used through the `.gst` accessor, like we showed in the [Data structures](./data_structures.ipynb#geost-accessor)\n",
    "section. This way we can select the boreholes that contain peat just like before. Let's load\n",
    "the same borehole data, but this time as a `pandas.DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "borehole_df = geost.read_table(borehole_file, as_collection=False)\n",
    "borehole_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can make the same selection using the `.gst` accessor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "peat_df = borehole_df.gst.select_by_values(\"lith\", \"V\")\n",
    "peat_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the shape of `peat_df` is the same as before in the data table of the `peat_boreholes` Collection: 670 rows x 32 columns.\n",
    "The selection result is also the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "peat_boreholes.data.equals(peat_df)  # Check equality of the two dataframes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, GeoST functionality can be used interchangeably on `geost.Collection` objects\n",
    "or `DataFrame` objects.\n",
    "\n",
    "## Positional columns\n",
    "GeoST requires that several data columns are present to ensure that the methods in a `Collection` will work.\n",
    "These are called \"positional columns\". The required positional columns differ per type of method. For example,\n",
    "[`Collection.slice_depth_interval`](../api_reference/generated/geost.base.Collection.slice_depth_interval.rst)\n",
    "requires depth information about the surface level of surveys and the depth of layers while\n",
    "[`Collection.select_with_points`](../api_reference/generated/geost.base.Collection.select_with_points.rst)\n",
    "requires a valid geometry. The presence of depth information or a valid geometry is needed\n",
    "for both methods to work however, both presences are optional. The method `slice_depth_interval`\n",
    "does not need a geometry to work and `select_with_points` does not need depth information.\n",
    "Therefore, their presence is only required when you want to use one of these methods. This\n",
    "was chosen as design to ensure the most flexibility for users with different needs.\n",
    "\n",
    "The only mandatory presence is the positional column which identifies each individual survey\n",
    "(e.g. \"nr\") in both the header and data table. The table below shows the required positional\n",
    "columns for all methods to work.\n",
    "\n",
    "| Name | dtype | Description | Mandatory |\n",
    "| ---- | ----- | ----------- | --------- |\n",
    "| nr | int, float, string | Identification name/number/code of the point survey | Yes |\n",
    "| x | int, float | X-, Easting- or lon-coordinate | No |\n",
    "| y | int, float | Y-, Northing- or lat-coordinate | No |\n",
    "| surface | int, float | Surface elevation of the point survey in m +NAP | In methods involving depth |\n",
    "| end | int, float | End depth of a point survey in m +NAP | No |\n",
    "| geometry | `shapely.geometry.Point` in case of point data | Geometry object of the survey location | In spatial methods |\n",
    "| depth/bottom | int, float | Depth of a measurement or bottom depth of a layer with respect to the surface level | In methods involving depth |\n",
    "| top | int, float | Top depth of a layer with respect to the surface level | No, is used in methods involving depth when the survey data is contains layered information |\n",
    "\n",
    "```{note}\n",
    "The names for the postional columns in the table above are chosen as they are as these are\n",
    "well-descriptive for the type of information they provide. However, it is not mandatory\n",
    "that the positional columns are named exactly like the names in the table above.\n",
    "```\n",
    "\n",
    "GeoST automatically determines which columns to use as positional columns. These can be\n",
    "checked with any `DataFrame` by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "borehole_df.gst.positional_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The table below shows which columns are automatically recognized as positional columns by GeoST.\n",
    "\n",
    "| Name | Recognized names |\n",
    "| ---- | ---------------- |\n",
    "| nr | \"nr\", \"bro_id\", \"nitg_nr\", \"nitg\", \"boorp\" |\n",
    "| x | \"x\", \"x-coord\", \"longitude\", \"lon\", \"easting\", \"x_bottom_rd\", \"x_rd_crd\", \"x_calc_crd\" |\n",
    "| y | \"y\", \"y-coord\", \"latitude\", \"lat\", \"northing\", \"y_bottom_rd\", \"y_rd_crd\", \"y_calc_crd\" |\n",
    "| surface | \"surface\", \"maaiveld\", \"mv\", \"height_nap\", \"surface_nap\" |\n",
    "| end | \"end\", \"einddiepte\", \"einddiepte_nap\", \"end_depth\", \"end_depth_nap\" |\n",
    "| top | \"top\", \"tv_top_nap\", \"top_diepte\", \"top_depth\", \"upperboundary\" |\n",
    "| depth | \"depth\", \"bottom\", tv_bottom_nap\", \"basis_diepte\", \"bottom_depth\", lowerboundary\" |\n",
    "\n",
    "```{note}\n",
    "**geometry** is not included in the positional columns as this uses the **\"active geometry column\"**\n",
    "attribute of a [`geopandas.GeoDataFrame`](https://geopandas.org/en/stable/docs/user_guide/data_structures.html#geodataframe)\n",
    "```\n",
    "```{note}\n",
    "The recognized names are case-insensitive as each name is checked in lowercase form.\n",
    "```\n",
    "\n",
    "To see how this works, let's rename the \"x\" and \"y\" columns in `borehole_df` and check the\n",
    "positional columns again: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "borehole_df.rename(columns={\"x\": \"Longitude\", \"y\": \"Latitude\"}, inplace=True)\n",
    "borehole_df.gst.positional_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that the new names are automatically picked up as positional columns. If one of\n",
    "the positional columns is not picked up, `None` is returned. We rename the \"surface\" column\n",
    "to some unknown name to show this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "borehole_df.rename(columns={\"surface\": \"unknown-surface-name\"}, inplace=True)\n",
    "borehole_df.gst.positional_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since \"surface\" is not recognized anymore, trying an analysis method which needs depth\n",
    "would now raise a `KeyError`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    borehole_df.gst.slice_depth_interval(0, 10)\n",
    "    print(\"Slicing successful\")  # Only prints if no error is raised\n",
    "except KeyError as e:\n",
    "    print(e)  # Print the error message instead of actually raising the error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To solve this problem, you can choose to rename your columns to recognized names. Read functions such as [`geost.read_table`](../api_reference/generated/geost.read_table.rst) have a `column_mapper` parameter which takes a dictionary with\n",
    "columns to rename and returns the data with the renamed columns. Read functions raise a `UserWarning` if any of the optional positional columns cannot be found and a `KeyError` if the column identifying surveys cannot be found.\n",
    "\n",
    "Alternatively, GeoST provides a simple way to add any column name to be recognized as a positional column name so you can make all functionality work for any kind of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "geost.add_positional_columns({\"surface\": \"unknown-surface-name\"})\n",
    "borehole_df.gst.positional_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the previously unrecognized name works again.\n",
    "\n",
    "```{note}\n",
    "Use the `persist=True` keyword argument to store the provided column aliases in a user-specific\n",
    "configuration file so they are automatically recognized in future sessions.\n",
    "```\n",
    "\n",
    "## Use with generic Geopandas/Pandas\n",
    "The `.gst` accessor also works on any GeoDataFrame or any DataFrame instance as long as it\n",
    "contains columns which can be recognized as positional columns, see the [previous](#positional-columns)\n",
    "section. Therefore, any data that has been loaded or created without GeoST can also use\n",
    "the provided functionality via the accessor.\n",
    "\n",
    "Let's demonstrate this with a simple `geopandas.GeoDataFrame` containing two points:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import geopandas as gpd\n",
    "\n",
    "gdf = gpd.GeoDataFrame(\n",
    "    {\"nr\": [1, 2]}, geometry=gpd.points_from_xy([1, 10], [1, 20]), crs=28992\n",
    ")\n",
    "print(gdf)\n",
    "print(\"\\nSelection result:\")\n",
    "print(gdf.gst.select_within_bbox(0, 0, 2, 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This also works for any `pandas.DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame(\n",
    "    {\"nr\": [\"a\", \"a\"], \"top\": [0, 1], \"bottom\": [1, 2], \"lith\": [\"clay\", \"sand\"]}\n",
    ")\n",
    "print(df)\n",
    "print(\"\\nSelection result:\")\n",
    "print(df.gst.slice_by_values(\"lith\", \"clay\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "default",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}