This proposal focuses on memory layout of new logical Geography type for DuckDB.
- Geography type assumes representation of angular coordinates, latitude and longitude in WGS84 coordinate system. However, memory layout is totally applicable to planar x,y coordinates.
- Latitude and Longitude are stored as two separate columns facilitating compression, to make use of spatial locality of adjacent data points. ie GPS coordinates are usually spatially close to each other, points in polygon co close to each other. Delta encoding is usually an effective choice for coordinates compression.
- All the coordinates for given vector are stored as contiguous arrays of doubles. Each Geography record appends its coordinates to arrays of latitude and longitude and saves offset and lengths.
- Supported OGC Simple Feature Access types:
- Point
- LineString
- Polygon
- MultiPoint
- MultiLineString
- MultiPolygon
- GeometryCollection
- Optional S2CellId based index. This would allow to push down some spatial predicates to the extended Scan operator and skip records, which don't satisfy the criteria.
Logical View:
Logical Record ID | latitude vector<double_t> | longitude vector<double_t> | type vector<uint8_t> enums | lines_len vector<size_t> "How many points?" | multi_len vector<size_t> "How many lines?" | coll_mpolly_len vector<size_t> "How many polygons?" | s2cellid vector<uint64_t> optional index |
---|---|---|---|---|---|---|---|
0 | 52.5255 | 13.3463 | Point | 1 | 824687234 | ||
1 | 52.5255 | 13.3463 | LineString | 3 | 824687234 | ||
52.5180 | 13.3496 | ||||||
52.5070 | 13.3496 | ||||||
2 | 52.5255 | 13.3469 | Polygon | 4 | 2 | 824687235 | |
52.5255 | 13.3467 | ||||||
52.5180 | 13.3468 | ||||||
52.5245 | 13.3496 | ||||||
52.5246 | 13.3496 | 3 | |||||
52.5983 | 13.2893 | ||||||
52.0245 | 13.9856 |
Physical Memory representation in DuckDB:
Vector<Struct> | ... | ... | ... | ... | ... | ... |
---|---|---|---|---|---|---|
{'latitude': List<double_t>} | {'longitude': List<double_t>} | {'type': Vector<uint8_t> } enums |
{'lines_len': Vector<idx_t> } "How many points?" |
{'multi_len': <Vector<idx_t> } "How many lines?" |
{'coll_mpolly_len': Vector<idx_t> } "How many polygons?" |
{'s2cellid': Vector<uint64_t> } optional index |
52.5255 | 13.3463 | Point | 1 | 824687234 | ||
52.5255 | 13.3463 | LineString | 3 | 824687234 | ||
52.5180 | 13.3496 | |||||
52.5070 | 13.3496 | |||||
52.5255 | 13.3469 | Polygon | 4 | 2 | 824687235 | |
52.5255 | 13.3467 | |||||
52.5180 | 13.3468 | |||||
52.5245 | 13.3496 | |||||
52.5246 | 13.3496 | 3 | ||||
52.5983 | 13.2893 | |||||
52.0245 | 13.9856 |
Data Structures view:
Vector<Struct>: {
'latitude': List<double_t>[[52.5255], [52.5255, 52.5180, 52.5070], [52.5255, 52.5255, 52.5180, 52.5245, 52.5246, 52.5983, 52.0245]],
'longitude': List<double_t>[[13.3463], [13.3463, 13.3496, 13.3496], [13.3469, 13.3467, 13.3468, 13.3496, 13.3496, 13.2893, 13.9856]],
'type': Vector<uint8_t>[GeographyType::POINT, GeographyType::LINESTRING, GeographyType::POLYGON],
'lines_len': Vector<idx_t>[1,3,4,3],
'multi_len': Vector<idx_t>[2],
'coll_mpolly_len': Vector<idx_t>[],
's2cellid': Vector<uint64_t>[824687234, 824687234, 824687235]
}
Reference implementation in pure C++: geography_type
Implementation using Vector physical DuckDB type: TBD