Reading XML files#
This example demonstrates how GeoST extracts data from BRO XML files and how extra data, that is not extracted by default, can be retrieved by extended the methodology of GeoST.
GeoST reads data from XML files by using a schema
that describes the structure of the file (i.e. where to find specific elements) and determines where the information ends up in the resulting GeoST object, for instance a BoreholeCollection
. Under the hood, GeoST uses the lxml library for the XML parsing which will be used in this tutorial as well to show some basic principles.
Borehole and CPT data in the Netherlands is distributed via a public national database: the BRO (“Basis Registratie Ondergrond”). The BRO contains five main sources of subsurface data:
Geological boreholes (BHR-G objects)
Geotechnical boreholes (BHR-GT objects)
Pedological boreholes (BHR-P objects)
Cone Penetration Tests (CPT objects)
Pedological soilprofile descriptions (SFR objects)
Downloading any of the above data sources from BROloket results in a set of XML files for the requested objects. GeoST can be used to read the data in the XML files for each of the data sources (see available read function for each object). Sometimes, XML data for one of the above objects may also be provided by companies (e.g. Wiertsema), which can differ slightly in structure compared to BRO XML files, but GeoST can work with these files too. For this tutorial, we will demonstrate how data is extracted from an XML file using a BHR-G XML file downloaded from the BRO but the principle is the same for the other data sources.
We will begin with the necessary imports.
from pathlib import Path
from lxml import etree
import geost
from geost.io import xml
bhrg_file = Path("../../data/bhrg_bro.xml") # BHR-G XML file for tutorial
Reading with default schema#
As said, a schema
is used to read the XML files and GeoST contains predefined schemas for each data source. With this schema, the most important header attributes (e.g. “nr”, “x”, “y”, “surface”) are retrieved and basic data attributes. For BHR-G objects these are attributes such as “top”, “bottom” and “lithology”. Anyone just interested in basic attributes or quick inspections can just use the default reader functions and does not have to worry about schemas or other things. The code below shows how easily the BHR-G file can be read. The result is a BoreholeCollection
that contains a header and data attribute.
bhrg = geost.read_bhrg(bhrg_file)
print(bhrg.header)
print(bhrg.data)
print(f"Retrieved header attributes: {bhrg.header.gdf.columns}")
print(f"Retrieved data attributes: {bhrg.data.df.columns}")
PointHeader instance containing 1 objects
nr crs surface vertical_datum end \
0 BHR000000396406 urn:ogc:def:crs:EPSG::28992 0.69 NAP -2.31
x y geometry
0 126149.0 452162.0 POINT (126149 452162)
LayeredData instance:
nr x y surface end top bottom \
0 BHR000000396406 126149.0 452162.0 0.69 -2.31 0.00 0.25
1 BHR000000396406 126149.0 452162.0 0.69 -2.31 0.25 1.60
2 BHR000000396406 126149.0 452162.0 0.69 -2.31 1.60 2.00
3 BHR000000396406 126149.0 452162.0 0.69 -2.31 2.00 2.50
4 BHR000000396406 126149.0 452162.0 0.69 -2.31 2.50 3.00
soilNameNEN5104
0 zwakZandigeKlei
1 sterkSiltigeKlei
2 zwakZandigeKlei
3 sterkZandigeKlei
4 zwakSiltigZand
Retrieved header attributes: Index(['nr', 'crs', 'surface', 'vertical_datum', 'end', 'x', 'y', 'geometry'], dtype='object')
Retrieved data attributes: Index(['nr', 'x', 'y', 'surface', 'end', 'top', 'bottom', 'soilNameNEN5104'], dtype='object')
Schema#
We can see that using the predefined schema
already retrieves many of the important attributes from the XML file. However, some users may need additional data attributes as they might be interested in other information which may be relevant to their questions. For this, we are going to explore a schema
and the XML structure in more detail. Let’s first check out the predefined schema that was used by default by geost.read_bhrg
. We will use pprint
to show the structure of the schema more clearly.
from pprint import pprint
bhrg_schema = xml.schemas.bhrg[
"BRO"
] # This would be `xml.schemas.bhrgt` for BHR-GT files
pprint(
bhrg_schema, sort_dicts=False
) # Use `sort_dicts=False` to keep the order of the schema
{'payload_root': 'dispatchDocument',
'nr': {'xpath': 'brocom:broId'},
'location': {'xpath': 'deliveredLocation/bhrgcom:location/gml:Point/gml:pos',
'resolver': <function parse_coordinates at 0x7f9064ebeca0>,
'el-attr': 'text'},
'crs': {'xpath': 'deliveredLocation/bhrgcom:location/gml:Point',
'resolver': <function parse_crs at 0x7f9064ebec00>},
'surface': {'xpath': 'deliveredVerticalPosition/bhrgcom:offset',
'resolver': <function safe_float at 0x7f9064ebed40>,
'el-attr': 'text'},
'vertical_datum': {'xpath': 'deliveredVerticalPosition/bhrgcom:verticalDatum',
'el-attr': 'text'},
'end': {'xpath': 'boring/bhrgcom:Boring/bhrgcom:finalDepthBoring',
'resolver': <function safe_float at 0x7f9064ebed40>,
'el-attr': 'text'},
'data': {'xpath': 'boreholeSampleDescription/bhrgcom:BoreholeSampleDescription/bhrgcom:descriptiveBoreholeLog/bhrgcom:DescriptiveBoreholeLog',
'resolver': <function process_bhrgt_data at 0x7f9064ebefc0>,
'layer-attributes': ['upperBoundary',
'lowerBoundary',
'soilNameNEN5104']}}
You may already recognize some of the header attributes that were in the BoreholeCollection
above such as “nr”, “crs”, “surface”. The simplest way to explain the above schema in relation to the resulting BoreholeCollection
is that the “data” key determines what will be in the data table and all other keys end up in the header table. Each key in the schema contains a subdictionary which “tells geost.read_bhrg
what to do” for each key.
The dictionaries of keys that end up in the header can have the keys “xpath”, “resolver” and “el-attr” and as you can see, they can have one or all of them. Below is a short description of each key:
“xpath” : Path to the desired element in the XML structure. This always needs to be present, otherwise nothing will be retrieved.
“resolver” : Python function (if necessary) to extract data in a specific way. For example, change a text value to a float.
“el-attr” : XML elements can have attributes such as a “text” attribute. This will ensure the attribute is taken as the result or be used as input for the resolver.
The dictionary of the “data” key in the schema also contains “xpath” and “resolver” but “el-attr” cannot be used here. Instead, it contains a “layer-attributes” key which, as the name suggests, extracts the desired attributes for each borehole layer in the BHR-G object. For now, the predefined schema only extracted the “upperBoundary”, “lowerBoundary” and “soilNameNEN5104” of each layer but we will show how the schema above can be extended to retrieve additional data.
Retrieving additional attributes#
Since the “upperBoundary”, “lowerBoundary” and “soilNameNEN5104” might not be enough information for every purpose it may be desired to retrieve additional layer attributes. Let’s checkout an element in the XML file that contains data for a single layer and see what additional attributes are present. Note that normally you could checkout the entire XML file in a texteditor such as Notepad++
but in this tutorial we will retrieve and print the layer element by using lxml to see which attributes are present.
xml_tree = etree.parse(
bhrg_file
).getroot() # Read the XML file and get the root element
# Unfortunately, the path to a layer element from the root is really long
path_layer_element = "dispatchDocument/BHR_G_O/boreholeSampleDescription/bhrgcom:BoreholeSampleDescription/bhrgcom:descriptiveBoreholeLog/bhrgcom:DescriptiveBoreholeLog/bhrgcom:layer/bhrgcom:Layer"
layer = xml_tree.find(path_layer_element, namespaces=xml_tree.nsmap)
print(etree.tostring(layer, pretty_print=True, encoding="unicode"))
<bhrgcom:Layer xmlns:bhrgcom="http://www.broservices.nl/xsd/bhrgcommon/3.1" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:brocom="http://www.broservices.nl/xsd/brocommon/3.0" xmlns="http://www.broservices.nl/xsd/dsbhrg/3.1" gml:id="BRO_0006">
<bhrgcom:upperBoundary uom="m">0.000</bhrgcom:upperBoundary>
<bhrgcom:upperBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">waargenomen</bhrgcom:upperBoundaryDetermination>
<bhrgcom:lowerBoundary uom="m">0.250</bhrgcom:lowerBoundary>
<bhrgcom:lowerBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:lowerBoundaryDetermination>
<bhrgcom:anthropogenic>onbekend</bhrgcom:anthropogenic>
<bhrgcom:rooted>onbekend</bhrgcom:rooted>
<bhrgcom:soil>
<bhrgcom:soilNameNEN5104 codeSpace="urn:bro:bhrg:SoilNameNEN5104">zwakZandigeKlei</bhrgcom:soilNameNEN5104>
<bhrgcom:gravelContentClass codeSpace="urn:bro:bhrg:GravelContentClass">onbekend</bhrgcom:gravelContentClass>
<bhrgcom:organicMatterContentClassNEN5104 codeSpace="urn:bro:bhrg:OrganicMatterContentClassNEN5104">matigHumeus</bhrgcom:organicMatterContentClassNEN5104>
<bhrgcom:carbonateContentClass codeSpace="urn:bro:bhrg:CarbonateContentClass">onbekend</bhrgcom:carbonateContentClass>
<bhrgcom:colour codeSpace="urn:bro:bhrg:Colour">onbekend</bhrgcom:colour>
<bhrgcom:particularConstituent>
<bhrgcom:constituentType codeSpace="urn:bro:bhrg:ParticularConstituentType">puin</bhrgcom:constituentType>
<bhrgcom:archiveClass codeSpace="urn:bro:bhrg:ArchiveClass">onbekend</bhrgcom:archiveClass>
</bhrgcom:particularConstituent>
</bhrgcom:soil>
</bhrgcom:Layer>
As we can see, additional attributes are available such as “gravelContentClass”, “carbonateContentClass” and “colour”. These can easily be retrieved by adding these to the list of “layer-attributes” in the schema and using the new schema as input to read the data. Let’s use the updated schema to read the XML file again and checkout the data attribute of the resulting BoreholeCollection
to that it works.
extra_attributes = ["gravelContentClass", "carbonateContentClass", "colour"]
bhrg_schema["data"]["layer-attributes"].extend(
extra_attributes
) # Add the list of extra attributes to the schema
bhrg = geost.read_bhrg(bhrg_file, schema=bhrg_schema)
print(bhrg.data)
LayeredData instance:
nr x y surface end top bottom \
0 BHR000000396406 126149.0 452162.0 0.69 -2.31 0.00 0.25
1 BHR000000396406 126149.0 452162.0 0.69 -2.31 0.25 1.60
2 BHR000000396406 126149.0 452162.0 0.69 -2.31 1.60 2.00
3 BHR000000396406 126149.0 452162.0 0.69 -2.31 2.00 2.50
4 BHR000000396406 126149.0 452162.0 0.69 -2.31 2.50 3.00
soilNameNEN5104 gravelContentClass carbonateContentClass colour
0 zwakZandigeKlei onbekend onbekend onbekend
1 sterkSiltigeKlei onbekend onbekend onbekend
2 zwakZandigeKlei onbekend onbekend onbekend
3 sterkZandigeKlei onbekend onbekend onbekend
4 zwakSiltigZand onbekend onbekend onbekend
Even though in this particular borehole the extra attributes have value “onbekend” (i.e. unknown) in every layer, we can see that they were retrieved from the XML file.
Extending the schema#
For now, we have shown how we can extend a schema to retrieve additional layer-attributes but we can also extend the schema to retrieve other information that will be added to the header. Before we show how this can be done, it is worth inspecting the XML file itself.
# We can print the XML tree that we loaded before
print(etree.tostring(xml_tree, pretty_print=True, encoding="unicode"))
<dispatchDataResponse xmlns:bhrgcom="http://www.broservices.nl/xsd/bhrgcommon/3.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:brocom="http://www.broservices.nl/xsd/brocommon/3.0" xmlns="http://www.broservices.nl/xsd/dsbhrg/3.1" xmlns:gml="http://www.opengis.net/gml/3.2">
<brocom:responseType>dispatch</brocom:responseType>
<brocom:requestReference>-</brocom:requestReference>
<brocom:dispatchTime>2025-08-04T11:36:40+02:00</brocom:dispatchTime>
<dispatchDocument>
<BHR_G_O gml:id="BRO_0013">
<brocom:broId>BHR000000396406</brocom:broId>
<brocom:deliveryAccountableParty>50200097</brocom:deliveryAccountableParty>
<brocom:qualityRegime>IMBRO/A</brocom:qualityRegime>
<deliveryContext codeSpace="urn:bro:bhrg:DeliveryContext">archiefoverdracht</deliveryContext>
<surveyPurpose codeSpace="urn:bro:bhrg:SurveyPurpose">onbekend</surveyPurpose>
<discipline codeSpace="urn:bro:bhrg:Discipline">geologie</discipline>
<surveyProcedure codeSpace="urn:bro:bhrg:SurveyProcedure">geen</surveyProcedure>
<researchReportDate>
<brocom:date>2004-09-22</brocom:date>
</researchReportDate>
<registrationHistory>
<brocom:objectRegistrationTime>2025-02-04T17:41:43+01:00</brocom:objectRegistrationTime>
<brocom:registrationStatus codeSpace="urn:bro:RegistrationStatus">voltooid</brocom:registrationStatus>
<brocom:registrationCompletionTime>2025-02-04T17:41:43+01:00</brocom:registrationCompletionTime>
<brocom:corrected>nee</brocom:corrected>
<brocom:underReview>nee</brocom:underReview>
<brocom:deregistered>nee</brocom:deregistered>
<brocom:reregistered>nee</brocom:reregistered>
</registrationHistory>
<NITGCode>B31G2129</NITGCode>
<reportHistory>
<bhrgcom:event>
<bhrgcom:date>
<brocom:date>2004-09-22</brocom:date>
</bhrgcom:date>
<bhrgcom:name codeSpace="urn:bro:bhrg:EventName">volledigGerapporteerd</bhrgcom:name>
</bhrgcom:event>
</reportHistory>
<deliveredLocation>
<bhrgcom:location>
<gml:Point srsName="urn:ogc:def:crs:EPSG::28992" gml:id="BRO_0001">
<gml:pos>126149.000 452162.000</gml:pos>
</gml:Point>
</bhrgcom:location>
<bhrgcom:horizontalPositioningDate>
<brocom:date>2004-09-22</brocom:date>
</bhrgcom:horizontalPositioningDate>
<bhrgcom:horizontalPositioningMethod codeSpace="urn:bro:bhrg:HorizontalPositioningMethod">onbekend</bhrgcom:horizontalPositioningMethod>
</deliveredLocation>
<deliveredVerticalPosition>
<bhrgcom:localVerticalReferencePoint codeSpace="urn:bro:bhrg:LocalVerticalReferencePoint">maaiveld</bhrgcom:localVerticalReferencePoint>
<bhrgcom:offset uom="m">0.690</bhrgcom:offset>
<bhrgcom:verticalDatum codeSpace="urn:bro:bhrg:VerticalDatum">NAP</bhrgcom:verticalDatum>
<bhrgcom:verticalPositioningDate>
<brocom:date>2004-09-22</brocom:date>
</bhrgcom:verticalPositioningDate>
<bhrgcom:verticalPositioningMethod codeSpace="urn:bro:bhrg:VerticalPositioningMethod">onbekend</bhrgcom:verticalPositioningMethod>
</deliveredVerticalPosition>
<standardizedLocation>
<brocom:location srsName="urn:ogc:def:crs:EPSG::4258" gml:id="BRO_0002">
<gml:pos>52.057009967 4.966538731</gml:pos>
</brocom:location>
<brocom:coordinateTransformation codeSpace="urn:bro:CoordinateTransformation">RDNAPTRANS2018</brocom:coordinateTransformation>
</standardizedLocation>
<boring>
<bhrgcom:Boring gml:id="BRO_0005">
<bhrgcom:boringStartDate>
<brocom:date>2004-09-22</brocom:date>
</bhrgcom:boringStartDate>
<bhrgcom:boringEndDate>
<brocom:voidReason>onbekend</brocom:voidReason>
</bhrgcom:boringEndDate>
<bhrgcom:preparation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" codeSpace="urn:bro:bhrg:Preparation" xsi:nil="true"/>
<bhrgcom:trajectoryExcavated>nee</bhrgcom:trajectoryExcavated>
<bhrgcom:rockReached>nee</bhrgcom:rockReached>
<bhrgcom:boringProcedure codeSpace="urn:bro:bhrg:BoringProcedure">onbekend</bhrgcom:boringProcedure>
<bhrgcom:finalDepthBoring uom="m">3.00</bhrgcom:finalDepthBoring>
<bhrgcom:stopCriterion codeSpace="urn:bro:bhrg:StopCriterionField">onbekend</bhrgcom:stopCriterion>
<bhrgcom:samplingProcedure codeSpace="urn:bro:bhrg:SamplingProcedure">onbekend</bhrgcom:samplingProcedure>
<bhrgcom:finalDepthSampling xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" uom="m" xsi:nil="true"/>
<bhrgcom:subsurfaceContaminated xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/>
<bhrgcom:boreholeCompleted>onbekend</bhrgcom:boreholeCompleted>
<bhrgcom:boredInterval>
<bhrgcom:BoredInterval gml:id="BRO_0003">
<bhrgcom:beginDepth uom="m">0.00</bhrgcom:beginDepth>
<bhrgcom:endDepth uom="m">3.00</bhrgcom:endDepth>
<bhrgcom:boringTechnique codeSpace="urn:bro:bhrg:BoringTechnique">handDraaien</bhrgcom:boringTechnique>
<bhrgcom:boredDiameter xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" uom="mm" xsi:nil="true"/>
</bhrgcom:BoredInterval>
</bhrgcom:boredInterval>
<bhrgcom:sampledInterval>
<bhrgcom:SampledInterval gml:id="BRO_0004">
<bhrgcom:beginDepth uom="m">0.00</bhrgcom:beginDepth>
<bhrgcom:endDepth uom="m">3.00</bhrgcom:endDepth>
<bhrgcom:preTreatment codeSpace="urn:bro:bhrg:PreTreatment">onbekend</bhrgcom:preTreatment>
<bhrgcom:samplingMethod codeSpace="urn:bro:bhrg:SamplingMethod">opDiepteLosroeren</bhrgcom:samplingMethod>
<bhrgcom:samplingQuality codeSpace="urn:bro:bhrg:SamplingQuality">geroerd</bhrgcom:samplingQuality>
</bhrgcom:SampledInterval>
</bhrgcom:sampledInterval>
</bhrgcom:Boring>
</boring>
<boreholeSampleDescription>
<bhrgcom:BoreholeSampleDescription gml:id="BRO_0012">
<bhrgcom:descriptionReportDate>
<brocom:date>2004-09-22</brocom:date>
</bhrgcom:descriptionReportDate>
<bhrgcom:descriptionProcedure codeSpace="urn:bro:bhrg:DescriptionProcedure">NEN5104plusOnbekend</bhrgcom:descriptionProcedure>
<bhrgcom:descriptiveBoreholeLog>
<bhrgcom:DescriptiveBoreholeLog gml:id="BRO_0011">
<bhrgcom:descriptionQuality codeSpace="urn:bro:bhrg:DescriptionQuality">geologischNEN5104Archief</bhrgcom:descriptionQuality>
<bhrgcom:describedSamplesQuality codeSpace="urn:bro:bhrg:DescribedSamplesQuality">geroerd</bhrgcom:describedSamplesQuality>
<bhrgcom:continuouslySampled>ja</bhrgcom:continuouslySampled>
<bhrgcom:descriptionLocation codeSpace="urn:bro:bhrg:DescriptionLocation">veld</bhrgcom:descriptionLocation>
<bhrgcom:describedMaterial codeSpace="urn:bro:bhrg:DescribedMaterial">grond</bhrgcom:describedMaterial>
<bhrgcom:sampleMoistness codeSpace="urn:bro:bhrg:SampleMoistness">onbekend</bhrgcom:sampleMoistness>
<bhrgcom:layer>
<bhrgcom:Layer gml:id="BRO_0006">
<bhrgcom:upperBoundary uom="m">0.000</bhrgcom:upperBoundary>
<bhrgcom:upperBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">waargenomen</bhrgcom:upperBoundaryDetermination>
<bhrgcom:lowerBoundary uom="m">0.250</bhrgcom:lowerBoundary>
<bhrgcom:lowerBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:lowerBoundaryDetermination>
<bhrgcom:anthropogenic>onbekend</bhrgcom:anthropogenic>
<bhrgcom:rooted>onbekend</bhrgcom:rooted>
<bhrgcom:soil>
<bhrgcom:soilNameNEN5104 codeSpace="urn:bro:bhrg:SoilNameNEN5104">zwakZandigeKlei</bhrgcom:soilNameNEN5104>
<bhrgcom:gravelContentClass codeSpace="urn:bro:bhrg:GravelContentClass">onbekend</bhrgcom:gravelContentClass>
<bhrgcom:organicMatterContentClassNEN5104 codeSpace="urn:bro:bhrg:OrganicMatterContentClassNEN5104">matigHumeus</bhrgcom:organicMatterContentClassNEN5104>
<bhrgcom:carbonateContentClass codeSpace="urn:bro:bhrg:CarbonateContentClass">onbekend</bhrgcom:carbonateContentClass>
<bhrgcom:colour codeSpace="urn:bro:bhrg:Colour">onbekend</bhrgcom:colour>
<bhrgcom:particularConstituent>
<bhrgcom:constituentType codeSpace="urn:bro:bhrg:ParticularConstituentType">puin</bhrgcom:constituentType>
<bhrgcom:archiveClass codeSpace="urn:bro:bhrg:ArchiveClass">onbekend</bhrgcom:archiveClass>
</bhrgcom:particularConstituent>
</bhrgcom:soil>
</bhrgcom:Layer>
</bhrgcom:layer>
<bhrgcom:layer>
<bhrgcom:Layer gml:id="BRO_0007">
<bhrgcom:upperBoundary uom="m">0.250</bhrgcom:upperBoundary>
<bhrgcom:upperBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:upperBoundaryDetermination>
<bhrgcom:lowerBoundary uom="m">1.600</bhrgcom:lowerBoundary>
<bhrgcom:lowerBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:lowerBoundaryDetermination>
<bhrgcom:anthropogenic>onbekend</bhrgcom:anthropogenic>
<bhrgcom:rooted>onbekend</bhrgcom:rooted>
<bhrgcom:soil>
<bhrgcom:soilNameNEN5104 codeSpace="urn:bro:bhrg:SoilNameNEN5104">sterkSiltigeKlei</bhrgcom:soilNameNEN5104>
<bhrgcom:gravelContentClass codeSpace="urn:bro:bhrg:GravelContentClass">onbekend</bhrgcom:gravelContentClass>
<bhrgcom:organicMatterContentClassNEN5104 codeSpace="urn:bro:bhrg:OrganicMatterContentClassNEN5104">zwakHumeus</bhrgcom:organicMatterContentClassNEN5104>
<bhrgcom:carbonateContentClass codeSpace="urn:bro:bhrg:CarbonateContentClass">onbekend</bhrgcom:carbonateContentClass>
<bhrgcom:colour codeSpace="urn:bro:bhrg:Colour">onbekend</bhrgcom:colour>
</bhrgcom:soil>
</bhrgcom:Layer>
</bhrgcom:layer>
<bhrgcom:layer>
<bhrgcom:Layer gml:id="BRO_0008">
<bhrgcom:upperBoundary uom="m">1.600</bhrgcom:upperBoundary>
<bhrgcom:upperBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:upperBoundaryDetermination>
<bhrgcom:lowerBoundary uom="m">2.000</bhrgcom:lowerBoundary>
<bhrgcom:lowerBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:lowerBoundaryDetermination>
<bhrgcom:anthropogenic>onbekend</bhrgcom:anthropogenic>
<bhrgcom:rooted>onbekend</bhrgcom:rooted>
<bhrgcom:soil>
<bhrgcom:soilNameNEN5104 codeSpace="urn:bro:bhrg:SoilNameNEN5104">zwakZandigeKlei</bhrgcom:soilNameNEN5104>
<bhrgcom:gravelContentClass codeSpace="urn:bro:bhrg:GravelContentClass">onbekend</bhrgcom:gravelContentClass>
<bhrgcom:organicMatterContentClassNEN5104 codeSpace="urn:bro:bhrg:OrganicMatterContentClassNEN5104">zwakHumeus</bhrgcom:organicMatterContentClassNEN5104>
<bhrgcom:carbonateContentClass codeSpace="urn:bro:bhrg:CarbonateContentClass">onbekend</bhrgcom:carbonateContentClass>
<bhrgcom:colour codeSpace="urn:bro:bhrg:Colour">onbekend</bhrgcom:colour>
</bhrgcom:soil>
</bhrgcom:Layer>
</bhrgcom:layer>
<bhrgcom:layer>
<bhrgcom:Layer gml:id="BRO_0009">
<bhrgcom:upperBoundary uom="m">2.000</bhrgcom:upperBoundary>
<bhrgcom:upperBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:upperBoundaryDetermination>
<bhrgcom:lowerBoundary uom="m">2.500</bhrgcom:lowerBoundary>
<bhrgcom:lowerBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:lowerBoundaryDetermination>
<bhrgcom:anthropogenic>onbekend</bhrgcom:anthropogenic>
<bhrgcom:rooted>onbekend</bhrgcom:rooted>
<bhrgcom:soil>
<bhrgcom:soilNameNEN5104 codeSpace="urn:bro:bhrg:SoilNameNEN5104">sterkZandigeKlei</bhrgcom:soilNameNEN5104>
<bhrgcom:gravelContentClass codeSpace="urn:bro:bhrg:GravelContentClass">onbekend</bhrgcom:gravelContentClass>
<bhrgcom:organicMatterContentClassNEN5104 codeSpace="urn:bro:bhrg:OrganicMatterContentClassNEN5104">zwakHumeus</bhrgcom:organicMatterContentClassNEN5104>
<bhrgcom:carbonateContentClass codeSpace="urn:bro:bhrg:CarbonateContentClass">onbekend</bhrgcom:carbonateContentClass>
<bhrgcom:colour codeSpace="urn:bro:bhrg:Colour">onbekend</bhrgcom:colour>
</bhrgcom:soil>
</bhrgcom:Layer>
</bhrgcom:layer>
<bhrgcom:layer>
<bhrgcom:Layer gml:id="BRO_0010">
<bhrgcom:upperBoundary uom="m">2.500</bhrgcom:upperBoundary>
<bhrgcom:upperBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">onbekend</bhrgcom:upperBoundaryDetermination>
<bhrgcom:lowerBoundary uom="m">3.000</bhrgcom:lowerBoundary>
<bhrgcom:lowerBoundaryDetermination codeSpace="urn:bro:bhrg:BoundaryPositioningMethod">voorbepaald</bhrgcom:lowerBoundaryDetermination>
<bhrgcom:anthropogenic>onbekend</bhrgcom:anthropogenic>
<bhrgcom:rooted>onbekend</bhrgcom:rooted>
<bhrgcom:soil>
<bhrgcom:soilNameNEN5104 codeSpace="urn:bro:bhrg:SoilNameNEN5104">zwakSiltigZand</bhrgcom:soilNameNEN5104>
<bhrgcom:gravelContentClass codeSpace="urn:bro:bhrg:GravelContentClass">onbekend</bhrgcom:gravelContentClass>
<bhrgcom:organicMatterContentClassNEN5104 codeSpace="urn:bro:bhrg:OrganicMatterContentClassNEN5104">onbekend</bhrgcom:organicMatterContentClassNEN5104>
<bhrgcom:carbonateContentClass codeSpace="urn:bro:bhrg:CarbonateContentClass">onbekend</bhrgcom:carbonateContentClass>
<bhrgcom:colour codeSpace="urn:bro:bhrg:Colour">onbekend</bhrgcom:colour>
<bhrgcom:sandFraction>
<bhrgcom:sandMedianClass codeSpace="urn:bro:bhrg:SandMedianClass">matigFijnNEN5104</bhrgcom:sandMedianClass>
<bhrgcom:sandSorting codeSpace="urn:bro:bhrg:SandSorting">onbekend</bhrgcom:sandSorting>
</bhrgcom:sandFraction>
</bhrgcom:soil>
</bhrgcom:Layer>
</bhrgcom:layer>
</bhrgcom:DescriptiveBoreholeLog>
</bhrgcom:descriptiveBoreholeLog>
</bhrgcom:BoreholeSampleDescription>
</boreholeSampleDescription>
</BHR_G_O>
</dispatchDocument>
</dispatchDataResponse>
By inspecting the XML tree we can see where specific elements can be found and how these relate to the “xpath” keys in the schema. The indentation level of the XML tree represents the hierarchy of so-called “elements” and “child-elements”.
The geost.read_bhrg
function begins searching for the data from the element <dispatchDocument>
because the schema defined this as the “payload_root”. This element can contain one or more child-elements which hold the actual borehole data: <BHR_G_O gml:id="BRO_xxxx">
elements. When the “payload_root” is not defined in an input schema, the function will begin to search from the root element of the tree (“dispatchDataResponse”) which may lead to unexpected results. We can see that the ID of the borehole is in the <brocom:broId>
element and that this was used for the “xpath” entry to retrieve “nr” as the borehole ID.
We can extend the schema to find additional header information by adding an extra dictionary with at least an “xpath” entry, and if needed, entries for “resolver” and/or “el-attr”. If we look at the complete XML tree above, we see that it has an element <registrationHistory>
with child-elements containing registration dates of this BHR-G object. Let’s assume for now that this is relevant for us and therefore we want to retrieve this as extra header information. We can achieve by extending the schema like below. We will first only retrieve the data by specifying where it is in the XML tree and then read the file again.
date = {"xpath": "registrationHistory/brocom:registrationCompletionTime"}
bhrg_schema["date"] = date
bhrg = geost.read_bhrg(bhrg_file, schema=bhrg_schema)
print(bhrg.header) # Print the header to see the new date attribute
print(
f"\nDtype of date: {bhrg.header['date'].dtypes}"
) # Print the data types of the header attributes
PointHeader instance containing 1 objects
nr crs surface vertical_datum end \
0 BHR000000396406 urn:ogc:def:crs:EPSG::28992 0.69 NAP -2.31
date x y geometry
0 2025-02-04T17:41:43+01:00 126149.0 452162.0 POINT (126149 452162)
Dtype of date: object
Adding your own resolver#
Now the date has been nicely added to the header of the BoreholeCollection
. However, the resulting datatype of the date is “object”, which is a “string-like” datatype in Pandas DataFrames. Since it is a date, it may be more convenient to directly create a “datetime” type. We can achieve this by adding our own “resolver” function that translates the text attribute of the date element we retrieved into a datetime type. Let’s write our helper function and update the schema accordingly.
import pandas as pd
def to_datetime(date_str):
"""Convert a date string to a pandas datetime object."""
return pd.to_datetime(date_str)
date = {
"xpath": "registrationHistory/brocom:registrationCompletionTime",
"resolver": to_datetime,
"el-attr": "text", # This ensures we use the text content of the element
}
bhrg_schema["date"] = date
bhrg = geost.read_bhrg(bhrg_file, schema=bhrg_schema)
print(bhrg.header) # Print the header to see the new date attribute
print(
f"\nDtype of date: {bhrg.header['date'].dtypes}"
) # Print the data types of the header attributes
PointHeader instance containing 1 objects
nr crs surface vertical_datum end \
0 BHR000000396406 urn:ogc:def:crs:EPSG::28992 0.69 NAP -2.31
date x y geometry
0 2025-02-04 17:41:43+01:00 126149.0 452162.0 POINT (126149 452162)
Dtype of date: datetime64[ns, UTC+01:00]
Now the datatype is “datetime64[ns, UTC+01:00]” which would be more convenient in operations that involve date selections for example. Off course, changing the dataype of date could have been done in other ways, without adding a resolver, but in some cases the data in elements needs to be processed in too specific ways that adding our resolvers is necessary.
Very specific needs#
Sometimes there may be purposes where the needs become too specific, and GeoST cannot load the data into Collection
objects anymore. This may occur when schemas become too specific, for example with resolvers that return too difficult data structures that cannot be easily added as rows in DataFrames. This is not immediately a problem because GeoST still provides functionality to parse the XML files, however, this does not directly result in a Collection
object but returns a dictionary with the retrieved data according to the schema.
Let’s assume for now that the bhrg_schema
has become too specific and the normal reader functions no longer are sufficient to use. Below shows how you can still use GeoST to retrieve the data for further use in your own application or continue to process it into something that is compatible with GeoST objects again.
data = xml.read_bhrg(
bhrg_file, schema=bhrg_schema
) # bhrg_schema is now our "too specific" schema
pprint(data, sort_dicts=False)
{'nr': 'BHR000000396406',
'location': (126149.0, 452162.0),
'crs': 'urn:ogc:def:crs:EPSG::28992',
'surface': 0.69,
'vertical_datum': 'NAP',
'end': 3.0,
'data': defaultdict(<class 'list'>,
{'upperBoundary': ['0.000',
'0.250',
'1.600',
'2.000',
'2.500'],
'lowerBoundary': ['0.250',
'1.600',
'2.000',
'2.500',
'3.000'],
'soilNameNEN5104': ['zwakZandigeKlei',
'sterkSiltigeKlei',
'zwakZandigeKlei',
'sterkZandigeKlei',
'zwakSiltigZand'],
'gravelContentClass': ['onbekend',
'onbekend',
'onbekend',
'onbekend',
'onbekend'],
'carbonateContentClass': ['onbekend',
'onbekend',
'onbekend',
'onbekend',
'onbekend'],
'colour': ['onbekend',
'onbekend',
'onbekend',
'onbekend',
'onbekend']}),
'date': Timestamp('2025-02-04 17:41:43+0100', tz='UTC+01:00')}
As you can see, still a clean dictionary is returned with data for each key you have added to your schema and you would still be able to continue working with the data. As said in the beginning of this tutorial, reading XML files for the other data sources (BHR-GT, BHR-P, CPT and SFR) works exactly the same way and has the same possibilities.