Ad-Hoc Virtual Vector Database

The power of processing vector data in a spatial capable database cannot be understated. Spatial processing is fundamental to all things GIS. That enterprise database can also be a major expense and knowledge sink requiring dedicated resources you may not have or want.

For most situations there is an alternative using 2 core open source packages that are likely already in your environment, GDAL and Spatialite. This concept is not new, it's been around for years. The beautiful simplicity of this approach has been lost in the noise of the "me too" or "the next big thing" software development.

Simple Example
Let's visualize this with an example. First, we'll create our database. The important thing to note is our data files can be any spatial format, Shapefile, GeoPackage, CSV, etc. Vector virtual files have many options available. Here's database.vrt:

   <OGRVRTLayer name="fips">
   <OGRVRTLayer name="county">
Our database.vrt has 2 data files and we've chosen one layer from each data file. Layer fips is used verbatim as fips. Layer california_county is used as county. Let's do a simple query of our new database. Using a join we'll get the county name and a point geometry from "county" using the fips code found in "fips":
ogrinfo -ro -dialect sqlite -sql "
      ,st_pointonsurface(c.geometry) as geometry 
   from fips f 
   join county c on c.geoid = f.fips
" fips.vrt
...and the results:
INFO: Open of `fips.vrt'
      using driver `OGR_VRT' successful.

Layer name: SELECT
Geometry: Unknown (any)
Feature Count: 4
Extent: (-121.275690, 34.720256) - (-120.063668, 39.583854)
Layer SRS WKT:
    DATUM["North American Datum 1983",
        ELLIPSOID["GRS 1980",6378137,298.257222101,
Data axis to CRS axis mapping: 2,1
Geometry Column = geometry
NAME: String (0.0)
  NAME (String) = Sierra
  POINT (-120.5106542 39.583854)

  NAME (String) = Sacramento
  POINT (-121.275689562383 38.377481)

  NAME (String) = Santa Barbara
  POINT (-120.063668092974 34.7202565)

  NAME (String) = Calaveras
  POINT (-120.657187511161 38.171181)
That's it! If you're working with data files that have hundreds or even thousands of rows, nothing more complex is required. You're often not dealing with data sets that have millions of rows. Every project, particularly small, static data, doesn't require that 'at scale' mentality and the massive resources that go with it. Why use a tractor when a shovel will do?

Organization & Large Data Files
Working with many or large data files DO require planning. Keeping your data homogeneous goes a long way. Here are some tips:

GeoPackage is an SQLite database in spatial format clothing. Being that it IS an actual database with each layer representing a table we can do database things, like indexing columns. Indexing is the difference between working with thousands of rows and millions of rows. You can create your SQL queries to take advantage of this across separate data files defined in your Virtual Database (database.vrt).

SoZIP or Search Optimized ZIP is a relatively new tool available to GDAL. It allows GeoPackage (et al) compression using ZIP compression with a minimal performance hit. Adding indexes to your .gpkg files will quickly bloat them. SoZIP will generally compress your indexed .gpkg file much smaller than the non-indexed original.

Optimize your geometry. Use the maximum possible vertices spacing, less is more!

Query Large Data Sets Example
The next example works with 3 larger layers. Using the above tips we've taken Domestic Names from Geonames data, indexed it and SoZip'd it. We did the same for National Hydro Data HUC8 basins and Tigerlines county data. With that, our virtual database (caBasins.vrt) looks like:

   <OGRVRTLayer name="geonames">
   <OGRVRTLayer name="basin">
   <OGRVRTLayer name="county">
Now our query. We want to know what HUC basins exist in California, how many Geonames lakes in each basin, what counties intersect each basin and a general point geometry for each basin. That gives us this query (caBasins.sql):
select as basin
   ,count(g.feature_name) as lakes
   ,group_concat(distinct as counties
   ,st_snaptogrid(st_pointonsurface(b.wkb_geometry), .0001) as wkb_geometry
from basin b
join geonames g on g.feature_class = 'Lake' and st_intersects(b.wkb_geometry, g.wkb_geometry)
join county c on c.statefp = '06' and st_intersects(c.wkb_geometry, b.wkb_geometry)
where 'CA' in (cast(b.states as text))
group by b.tnmid
...and lets create a cloud ready FlatGeobuf (.fgb) from our data:
ogr2ogr -dialect sqlite -sql @caBasins.sql -nln basinmeta -t_srs EPSG:4326 caBasins.fgb caBasins.vrt
Here's a snippet of our results:
ogrinfo -ro caBasins.fgb basinmeta

INFO: Open of `caBasins.fgb'
      using driver `FlatGeobuf' successful.

Layer name: basinmeta
Geometry: Unknown (any)
Feature Count: 93
Extent: (-124.244200, 33.216400) - (-115.871900, 41.575100)
Layer SRS WKT:
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
        AXIS["geodetic latitude (Lat)",north,
        AXIS["geodetic longitude (Lon)",east,
Data axis to CRS axis mapping: 2,1
basin: String (0.0)
lakes: Integer (0.0)
counties: String (0.0)
  basin (String) = San Felipe Creek
  lakes (Integer) = 15
  counties (String) = San Diego,Riverside,Imperial
  POINT (-116.3174 33.2524)

  basin (String) = Whitewater River
  lakes (Integer) = 6
  counties (String) = San Bernardino,Riverside
  POINT (-116.3515 33.8096)

  basin (String) = Southern Mojave
  lakes (Integer) = 66
  counties (String) = San Bernardino,Riverside,Imperial
  POINT (-115.8719 34.3227)

  basin (String) = Mojave
  lakes (Integer) = 21
  counties (String) = Los Angeles,San Bernardino,Kern
  POINT (-116.4249 34.8298)

  basin (String) = San Gabriel
  lakes (Integer) = 18
  counties (String) = Los Angeles,San Bernardino,Orange
  POINT (-117.9206 34.0177)

Final Thoughts
Over time you have probably accumulated a large number of vector data sets. Stored strategically as .gpkg and indexed properly, they are ready to be used. Create a .vrt and SQL query and off you go.

Having access to an enterprise spatial database such as PostgreSQL/PostGIS will always be highly desirable. In some cases you may not have the resources or need. I hope the above will help you process your vector data in the leanest possible manner. © 2005-2023 - W3C - @postholer