GeoParquet Fun
GeoParquet files and a SedonaDB/DuckDB rabbit hole
One of the nice things about working for GeoPlace is the Continued Professional Development (CPD) time I get to use during my paid working hours.
Over last few weeks I’ve been diving into the world of spatially matching USRNs to other datasets and have been trying to come up with faster ways to do this.
I’ve learnt a few things along the way and this is a small post that attempts to summarise it.
In a nutshell:
-
GeoParquet files are great but can be a bit spatially chaotic if not optimised.
-
You can reduce/remove this chaoticness by using a few techniques:
- Calculating a bounding box struct per row.
- Hilbert curve sorting the data.
- Adding covering metadata (GeoParquet version 1.1).
-
This allows for row group pruning/spatial predicate pushdown to reduce the amount of data that needs to be read in and processed for a given query.
-
SedonaDB and DuckDB can make use of these optimisations.
As a result:
-
With an optimised GeoParquet file it takes SedonaDB around 0.068s to retrieve the USRNs in Liverpool when we apply an ST_Intersects spatial filter with a bbox over that area.
-
With a non-optimised GeoParquet file it takes around 0.322s.
-
One way reads in only 160,000 rows (and 8 out of 89 row groups), the other the full 1,760,546 (and all row groups).
-
It’s easy to picture how this scales the larger the datasets become.
The main idea is that we want to isolate the spatially relevant data as quickly as possible.
A lot of this thinking has fed into this python package -> usrn-matcher
NOTE: The work I’ve done doesn’t account for some of the advances made in the GeoParquet 2.0 spec: https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going-native/.
Getting your spatial ducks in a row
So, take for example the 1.7 million USRNs from the Ordnance Survey Open USRN dataset - it’s currently provided as a GeoPackage.
The aim is to get this data into a GeoParquet format that has been optimised for spatial queries.
How do we do it?
I currently use a simple combination of DuckDB and add a small patching function in the following way:
- Calculate a bbox struct per row (so that Parquet file row group statistics can make use of them)
- Order by a ST_Hilbert sort (against a bbox representing the full British National Grid area) - this is important see here: https://github.com/duckdb/duckdb-spatial/discussions/419
- Patch the required covering metadata
- Bump the GeoParquet version to 1.1
con.execute(f"""
COPY (
SELECT
* EXCLUDE "{src_geom}",
"{src_geom}" AS geometry,
{{
'xmin': ST_XMin("{src_geom}"),
'ymin': ST_YMin("{src_geom}"),
'xmax': ST_XMax("{src_geom}"),
'ymax': ST_YMax("{src_geom}")
}} AS bbox
FROM st_read('{config.source_path}')
ORDER BY ST_Hilbert("{src_geom}", {_BNG_BOX})
) TO '{config.parquet_path}'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE {config.row_group_size})
""")
_patch_covering_metadata(path=config.parquet_path, row_group=config.row_group_size, crs=config.crs)
The covering metadata is essentially something like this:
_EXPECTED_COVERING_METADATA: _Covering = {
"bbox": {
"xmin": ["bbox", "xmin"],
"ymin": ["bbox", "ymin"],
"xmax": ["bbox", "xmax"],
"ymax": ["bbox", "ymax"],
}
}
Inside a simple function, I bump the version and apply the covering metadata:
geo_meta["version"] = "1.1.0"
geo_meta["columns"][geom_col]["covering"] = _EXPECTED_COVERING_METADATA
The covering metadata is a simple piece of information that points to the bounding box struct column that we created alongside each row in the initial DuckDB query above.
Its main purpose is to enable spatial predicate pushdown.
This allows query engines, such as SedonaDB and DuckDB, to skip spatially irrelevant row groups using Parquet’s native min/max column statistics, without ever having to load the data in.
- Check the spec GeoParquet 1.1 here: https://geoparquet.org/releases/v1.1.0/
- The SedonaDB docs: https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
- A useful python package for handling Geoparquet files: https://geoparquet.io/concepts/geoparquet-overview/?h=covering#covering-metadata
I had a think about the best way to visualise this and gleaned a few ideas from the discussion in this GitHub thread: https://github.com/duckdb/duckdb-spatial/discussions/419
Here we have two images showing spatially sorted USRNs and randomly ordered USRNs.
The brighter one (unoptimised) is more expensive to spatially query, and the darker one (optimised) is less expensive.
In a Hilbert-sorted file, USRNs that are close in space are also close together in the file - most likely in the same row group or one very close by. See here for what a row group is: https://parquet.apache.org/docs/concepts/ & https://duckdb.org/docs/current/data/parquet/tips.
In a non Hilbert-sorted file, USRNs that are close in space are not close together in the file - most likely in row groups that are far apart.
The images demonstrate this in the following way:
-
Bright pixels = the USRN centroids in a given pixel come from many numerically different row groups in the file (stuff is spread out).
-
Dark pixels = the USRN centroids in a given pixel come from the same or numerically close by row groups in the file (stuff is compact).
Left: unoptimised (random order). Right: optimised (Hilbert-sorted). Brighter = row groups are scattered. Darker = row groups are spatially compact.
NOTE: Image contains OS Open USRN data. This is free to use under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2025.
NOTE: You do get dash of coloured lines here and there on the optimised output, but this is due the recursive nature of the Hilbert Sort algorithm that folds back on itself.
Getting your spatial ducks to waddle
I did a simple test locally to prove that it was quicker to query the optimised GeoParquet file.
As we now know, the optimised file:
- Stores a bbox struct alongside each geometry.
- Has covering metadata.
- Is spatially sorted.
This allows engines to skip row groups when reading the file.
The unoptimised file has no spatial ordering, so every query scans the full 1.76M rows.
The test below measured a single operation — reading USRNs from a Parquet file filtered to a city bounding box, no joins, no matching. The goal is to show how structuring a GeoParquet file affects how much data each engine reads.
NOTE: I’ll do a follow up post on how this also helps speed up how we do spatial joins.
Output Table
| Engine | File | Strategy | London | Liverpool |
|---|---|---|---|---|
| DuckDB | optimised | bbox struct (column stats) | 0.13s / 396K scanned | 0.02s / 158K scanned |
| SedonaDB | optimised | ST_Intersects + covering | 0.15s / 400K scanned | 0.07s / 160K scanned |
| DuckDB | optimised | ST_Intersects (no pruning) | 0.91s / 1.76M scanned | 0.80s / 1.76M scanned |
| DuckDB | unoptimised | ST_Intersects (no pruning) | 0.85s / 1.76M scanned | 0.82s / 1.76M scanned |
| SedonaDB | unoptimised | ST_Intersects (no pruning) | 0.32s / 1.76M scanned | 0.32s / 1.76M scanned |
London
[optimised] row_groups=89
DuckDB (bbox struct ) 0.131s out=93,328 / scanned=395,628 / file=1,760,546 rows
DuckDB (ST_Intersects) 0.908s out=93,321 / scanned=1,760,546 / file=1,760,546 rows ← no pruning
Sedona (ST_Intersects) 0.153s out=93,321 / scanned=400,000 / file=1,760,546 rows
pruning: row_groups_pruned_statistics=89/89 rgs row_groups_spatial_pruned=20/89 rgs
[unoptimised] row_groups=15
DuckDB (ST_Intersects) 0.853s out=93,321 / scanned=1,760,546 / file=1,760,546 rows
Sedona (ST_Intersects) 0.315s out=93,321 / scanned=1,760,546 / file=1,760,546 rows
pruning: row_groups_pruned_statistics=15/15 rgs row_groups_spatial_pruned=0/0 rgs
Liverpool bbox
[optimised] row_groups=89
DuckDB (bbox struct ) 0.024s out=12,676 / scanned=158,251 / file=1,760,546 rows
DuckDB (ST_Intersects) 0.795s out=12,668 / scanned=1,760,546 / file=1,760,546 rows ← no pruning takes place
Sedona (ST_Intersects) 0.068s out=12,668 / scanned=160,000 / file=1,760,546 rows
pruning: row_groups_pruned_statistics=89/89 rgs row_groups_spatial_pruned=8/89 rgs
[unoptimised] row_groups=15
DuckDB (ST_Intersects) 0.818s out=12,668 / scanned=1,760,546 / file=1,760,546 rows
Sedona (ST_Intersects) 0.322s out=12,668 / scanned=1,760,546 / file=1,760,546 rows
pruning: row_groups_pruned_statistics=15/15 rgs row_groups_spatial_pruned=0/0 rgs
Both engines prune row groups as expected with the optimised file and this results in the speed increase we’re looking for.
NOTE: I’ll do a follow up post that explores the different mechanisms that each engine uses to read GeoParquet files and make use of these optimisations.
At the moment, I’m still a bit puzzled as to why DuckDB doesn’t row group prune when using ST_Intersect but does so when you manually specify bbox ranges - Sedona row group prunes perfectly fine when using ST_Intersects.
Code snippet from my local tests:
def _run_duck(path: str, bbox: list[int], force_spatial: bool = False) -> DuckResult:
"""
Run DuckDB test.
This test 2 methods, one with manually bbox ranges, and another with ST_Intersect.
"""
xmin, ymin, xmax, ymax = bbox
if _has_bbox_struct(path) and not force_spatial:
predicate = (
f"bbox.xmin <= {xmax} AND bbox.xmax >= {xmin} "
f"AND bbox.ymin <= {ymax} AND bbox.ymax >= {ymin}"
)
strategy = "bbox struct"
else:
predicate = (
f"ST_Intersects(geometry, ST_MakeEnvelope({xmin}, {ymin}, {xmax}, {ymax}))"
)
strategy = "ST_Intersects"
sql = f"SELECT * FROM read_parquet('{path}') WHERE {predicate}"
explain_text = "\n".join(
row[1] for row in duck.sql(f"EXPLAIN ANALYZE {sql}").fetchall()
)
scan_rows = _duck_scan_rows(path, bbox, force_spatial)
t0 = time.perf_counter()
output_rows = len(duck.sql(sql).fetchall())
return DuckResult(
strategy=strategy,
filter_text=_extract_plan_filters(explain_text),
scanned_rows=scan_rows,
elapsed=time.perf_counter() - t0,
output_rows=output_rows,
)
My assumption is that DuckDB doesn’t use the covering metadata which points to the bbox struct column. - we’ll see.
It’s a great time to be working with Geospatial data!