Back to Resources
·Spatial Data Infrastructure

Building a Lead Pipeline: Scraping, Geocoding, and Ranking Warehouses

Connor Jennings

Spatial indexing is foundational to any system that stores or queries geographic data. As organizations move spatial workloads to distributed cloud architectures, the question of which indexing strategy to use becomes non-trivial. Index structures optimized for single-node databases don't necessarily translate to distributed environments where data is partitioned across nodes.

Background

The R-tree family of spatial indices has been the dominant approach for decades. Variants include the original R-tree, R*-tree (which optimizes node splits), and STR-packed R-trees (which bulk-load data using Sort-Tile-Recursive ordering). Each makes different tradeoffs between insert performance, query performance, and storage overhead.

In a distributed setting, additional factors come into play:

  • Data partitioning strategy (spatial vs. hash-based)
  • Network latency for cross-partition queries
  • Rebalancing cost when data distribution is non-uniform
  • Consistency requirements during concurrent writes

Methodology

We benchmarked three R-tree variants across four distributed architectures:

  • CockroachDB with PostGIS-compatible spatial extensions
  • YugabyteDB with spatial indexing
  • Apache Sedona on Spark clusters
  • DuckDB in a partitioned-file architecture over S3

Each system was loaded with OpenStreetMap building footprints for the continental United States (~130M polygons). We measured:

  • Point-in-polygon query latency (p50, p95, p99)
  • Spatial join throughput (buildings × parcels)
  • Bulk insert rate
  • Index build time
  • Storage overhead relative to raw data

Key Findings

STR-packed indices consistently outperformed dynamic R-tree variants by 3-5x for read-heavy workloads. The bulk-loading approach produces tighter bounding boxes and more balanced trees, which translates directly to fewer node accesses per query.

However, dynamic R*-trees showed advantages for write-heavy workloads where data arrives incrementally. The cost of periodic full reindexing with STR packing can be prohibitive for systems ingesting real-time sensor data.

The partitioned-file architecture (DuckDB over S3 with GeoParquet) delivered surprisingly competitive read performance at a fraction of the operational cost, suggesting that for analytical workloads, the traditional database approach may be over-engineered.

Implications

For spatial data infrastructure serving primarily read workloads — which describes most web mapping, asset management, and analytical systems — a bulk-loaded STR index on partitioned GeoParquet files offers an excellent balance of performance, cost, and operational simplicity. Systems with mixed read-write workloads should consider hybrid approaches: R*-trees for the active write partition with periodic STR compaction.