ESQL Support loading points from source into WKB blocks #103698

craigtaverner · 2023-12-22T17:59:07Z

The work in #102177 added support for geo_point and cartesian_point to ES|QL. However, that work relied on ES|QL existing support for doc-values, and the LongBlock in particular, to handle all points inside the compute engine. This made the implementation relatively easy, but did come at the cost of displaying points quantized to the encoded grid used in the Lucene point index. For most use cases, this is fine as the error is very, very small. In particular this made it useful for Kibana to start work on displaying geo_point data in the Kibana map.

However, for the GA version of spatial support in ES|QL, we definitely want to have the users see exactly the same precision that they ingested the data with. To achieve this we needed to enable reading from source for points. The design is to use WKB encoding within a BytesRefBlock to store all kinds of Geometries, so this will also be of use for geo_shape and shape going forward.

The initial work started in #102880, but here we are replacing the use of PointBlock with BytesRefBlock and evaluating the difference.

Update docs/changelog/102880.yaml Make SpatialPoint a concrete class Simplifies code in a few places, and is a step towards easier implementation of other features, especially CRS support and ESQL support. Continued work on point block support Spotless checks for generated code Disable test for POINT in a few complex cases * MultivalueDedupeTests seems to order POINTs incorrectly * TopNOperatorTests might need to exclude POINT always unless we want to support POINT in TopN results * BlockHashRendomizedTests has failures on POINT, perhaps an issue with different hashcode calculations? Added missing generated code for building point types Get CsvTests working with more support for PointBlock Fixed ToString and ToLong tests after change of point from encoded Added missing new generated files Updated changelog to be ESQL specific. A hint as to where to tie into the planner for doc-values vs source Small refinement to PointArray/Vector memory estimates Small refinement to PointArray/Vector memory estimates Small refinement to PointArray/Vector memory estimates Reorganize BlockFectory to improve symmetry between Double and Point Support read/write Point in StreamInput/Output Fix some more tests Fixed asymmetry between Geo and Cartesian field types This was expressed as test failures for CartesianPoint since it was still reading from doc-values instead of from source, while the rest of the stack expected source. Disable failing field type tests for now Currently reading from source for GeoPointFieldMapper is not properly supported in the tests Fixed failing tests with unsupported types and precision change Fix tests that still used longs for spatial points Added source values to spatial types in textFormatting tests Fixed EsqlQueryResponse to map point back to geo_point after deserialization Fixed binary comparison tests for spatial types In particular, we do not support inequality operators on spatial types. For example: point < point, is not a meaningful expression. Fixed mixed-cluster tests for csv-tests and spatial precision Fixed mixed-cluster tests for esql types and point precision More tests and use warning for parse error Remove SpatialPoint from StreamInput/Output Added fromStats() for spatial block values from doc-values Serialize GeoPoint and SpatialPoint in ESQL plan Retain point types in plan serialization Fixup after rebase * Removed support for unsigned_long from ToGeoPoint and ToCartesianPoint, since the mapping from unsigned_long to long allows for invalid values * Support new X-BigArrayBlock introduced by Nhat Fix test where there can be duplicated hashcodes When cartesian decoding fails, we should test and mimic that failure

This commit does not yet remove support for PointBlock, since we are investigating how to kabe multiple block types backing the same geometries.

elasticsearchmachine · 2023-12-22T17:59:54Z

Hi @craigtaverner, I've created a changelog YAML for you.

nik9000 · 2023-12-27T15:56:19Z

server/src/main/java/org/elasticsearch/common/geo/GeoPoint.java

@@ -39,31 +38,31 @@ public GeoPoint() {}
     * @param value String to create the point from
     */
    public GeoPoint(String value) {
+        this();


Might want to add a comment here. We don't see a whole lot of this();.

nik9000 · 2023-12-27T15:58:29Z

server/src/main/java/org/elasticsearch/index/mapper/AbstractGeometryFieldMapper.java

+        @Override
+        public BlockLoader blockLoader(BlockLoaderContext blContext) {
+            // TODO: If we have doc-values we have to use them, due to BlockSourceReader.columnAtATimeReader() returning null
+            if (blContext.forStats() && hasDocValues()) {


I wonder if we should call forStats something like transformed.

There are many options for this. I was looking at it from two perspectives:

From the field type, this relates to whether or not to use doc-values, but considering how most types favor doc-values and the actual doc-values and source values are identical, this is very specific to spatial types, so seems like the wrong perspective.

From the planner side, what is the intended use of this field, with there being two somewhat different uses: just pass through and include in the results, or consume as part of an aggregation. You originally suggested forDisplay, which is the complement of my preference for forStats.

I feel like transformed is an obsfucation of forStats as it does not directly describe the intended use (transformed by stats?), nor does it describe the consequences exactly (transformed into doc-values?).

Are you suggesting that transformed might be a widening? Are there other cases where we might transform the value in such a way that we would like to use doc-values in the field type? Perhaps eval with functions that transform the value? I can imagine such cases, like EVAL proximity = ST_BUFFER(road, 10, 'join=mitre mitre_limit=1.0') which produces a new geometry and as such can get a performance benefit by not reading from source? That is definitely worth thinking about!

Lots of interesting ideas here, but I think we should leave this for the next PR where we bring back doc-values in the planner. For now a simple boolean on/off flag is sufficient for me to get the tests unmuted.

nik9000 · 2023-12-27T15:59:55Z

server/src/main/java/org/elasticsearch/index/mapper/AbstractGeometryFieldMapper.java

+
+        @Override
+        public BlockLoader blockLoader(BlockLoaderContext blContext) {
+            // TODO: If we have doc-values we have to use them, due to BlockSourceReader.columnAtATimeReader() returning null


I don't understand this one. It's generally ok for columnAtATimeReader to return null - that's just a signal that you need to read row at a time. Which is how you read from stored fields.

I removed everything but the read-from-source mode now. I'll bring back the doc-values in the next PR.

nik9000 · 2023-12-27T16:02:06Z

server/src/main/java/org/elasticsearch/index/mapper/BlockSourceReader.java

+                    Geometry geometry = WellKnownText.fromWKT(GeometryValidator.NOOP, false, wkt);
+                    if (geometry instanceof Point point) {
+                        // TODO: perhaps we should not create points for later GC here, and pass in primitives only?
+                        ((BlockLoader.PointBuilder) builder).appendPoint(new SpatialPoint(point.getX(), point.getY()));


Indeed, we should try not to create any more garbage than we have to here. The _source loading is going to allocate - it has to do so to get at _source, but we should avoid it when we can.

We've simplified this now, by generating WKB in the value fetcher and not converting types.

nik9000 · 2023-12-27T16:02:39Z

server/src/main/java/org/elasticsearch/index/mapper/BlockSourceReader.java

+                    } else {
+                        throw new IllegalArgumentException("Cannot convert geometry into point:: " + geometry.type());
+                    }
+                } catch (Exception e) {


Probably best to catch precise exceptions here.

Done. It was two different exceptions, so I'll use catch (IOException | ParseException e)

nik9000 · 2023-12-27T16:04:59Z

x-pack/plugin/esql/compute/build.gradle

+  var longProperties     = prop("Long", "long", "LONG", "Long.BYTES", "LongArray")
+  var doubleProperties   = prop("Double", "double", "DOUBLE", "Double.BYTES", "DoubleArray")
+  var bytesRefProperties = prop("BytesRef", "BytesRef", "BYTES_REF", "org.apache.lucene.util.RamUsageEstimator.NUM_BYTES_OBJECT_REF", "")
+  var pointProperties    = prop("Point", "SpatialPoint", "POINT", "16", "ObjectArray<SpatialPoint>")


Ah! Yeah, that probably needs to be double[len * 2] or something.

Deleted the entire PointBlock, so this is removed.

nik9000 · 2023-12-27T16:08:05Z

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/topn/WKBTopNEncoder.java

+ * Since WKB can contain bytes with zero value, which are used as terminator bytes, we need to encode differently.
+ * Our initial implementation is to re-write to WKT and encode with UTF8TopNEncoder.
+ * This is likely very inefficient.
+ * We cannot use the UTF8TopNEncoder as is, because it removes the continuation byte, which could be a valid value in WKB.


Ooof. Ouch.

I was thinking a nicer option would be to base64 encode the binary. Might be more space efficient than converting WKB to WKT and back.

That's terrible that we cannot support generic binary data because we are giving special meaning to some bytes.

ok, not sure what you want to use the utf8 encoder when we are not utf8.

I think geometries are not sortable so this can be implemented as non sortable and we can encode it / decode it using a runlen prefix (saving how many bytes are to be written / we need to read ).

++. We can just runlen prefix it.

If we need to sort the geometries one day we'd come up with an encoding.

We probably should just call this UnsortableViableLengthBytesEncoder or something.

nik9000 · 2023-12-27T16:11:58Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/util/SpatialCoordinateTypes.java

+    public long wkbAsLong(BytesRef wkb) {
+        Geometry geometry = WellKnownBinary.fromWKB(GeometryValidator.NOOP, false, wkb.bytes, wkb.offset, wkb.length);
+        if (geometry instanceof Point point) {
+            return pointAsLong(point.getX(), point.getY());


This is all pretty allocate-y, but it's the tools we have.

nik9000 · 2023-12-27T16:13:00Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/util/SpatialCoordinateTypes.java

 import org.elasticsearch.common.geo.GeoPoint;
 import org.elasticsearch.common.geo.SpatialPoint;
 import org.elasticsearch.geometry.Geometry;
 import org.elasticsearch.geometry.Point;
 import org.elasticsearch.geometry.utils.GeometryValidator;
+import org.elasticsearch.geometry.utils.WellKnownBinary;


I think this enum here is kind of at the core of the implementation - we "read" the wkb using this. I'd prefer to have something that can manage to allocate less, but that can come with time.

Yeah, this enum is collecting some cruft as we refactor the code, but I think we'll have the opportunity to cleanup a bit here once we have geo_shape completed and know the full scope of what is really needed here. The main pain is with the docs-values using such different encoding between the four main types... that is such an unnecessary hassle! But we cannot escape it because Lucene enforces that.

We will use only WKB encoded in BytesRefBlock for now

…ce_wkb

For plan serialization we needed to care that 8.12 expected point literals to serialize as encoded long values.

We can try this again in another PR

elasticsearchmachine · 2024-01-02T16:34:21Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2024-01-02T16:34:21Z

Pinging @elastic/es-ql (Team:QL)

…ce_wkb

Based on Niks support for not using synthetic source, but also by using WKT instead of WKB for test assertions. This is partly because it is easier to debug, but also because the test code uses base64 for encoding expected values, and the production code does not. Switching to WKT avoids that test code pitfall.

iverase · 2024-01-05T11:57:26Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/io/stream/PlanNamedTypes.java

+                if (value instanceof List<?> list) {
+                    return list.stream().map(v -> mapFromLiteralValue(out, dataType, v)).toList();
+                }
+                if (value instanceof BytesRef wkb) {


We should throw an exception if the value is not a BytesRef? I would either cast it directly so we throw a CastClassException or add an else statement? I think I prefer the first one as the error will be clear.

iverase · 2024-01-05T11:59:46Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/io/stream/PlanNamedTypes.java

+     * This only makes sense during the pre-GA version of ESQL. When we get near GA we want TransportVersion support.
+     * TODO: Implement TransportVersion checks before GA (eg. by adding to StreamInput/StreamOutput directly)
+     */
+    private static Object mapToLiteralValue(DataType dataType, Object value) {


we need to check the version here as well sorry

iverase

LGTM

I think there is work to do on how the response is built but can be done in a follow up PR.

iverase · 2024-01-05T12:59:19Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/esql/40_unsupported_types.yml

+spatial types with different precision in 8.12.x:
+  - skip:
+      version: " - 8.11.99, 8.13.0 - "
+      reason: "Elasticsearch 8.12 supported geo_point and cartesian_point with doc-values only and precision differences"


should be version: " - 8.11.99, 8.12.99 - "?

I believe the range is inclusive, not exclusive, so it is correct as it is. We skip everything earlier than 8.12.0 and everything from 8.13.0 and later. Basically this is the same as saying "test only 8.12.*".

However, looking at #103947, it seems there is interest in removing all tests that try to test versions other than the current one. I suspect we might be asked to remove this test entirely.

…ce_wkb

We had this muted before, then fixed it, and now muted it again after elastic#103632 (ESQL: Check field exists before load from _source)

Since we might start testing ESQL in serverless any day now, this is a better option, as the previous approach was more stateful (relied on 8.12.x vs. 8.13.x).

The code in main was using an optimization that is not supported by GeoPointFieldMapper, but could be done as a future improvement.

The nullValue is injected into source fetching in such a way that is is epxected to be in source format. So we need to convert it back to a source format. We picked WKT because that is clear and simple in the debugger, but GeoJSON and an object map of lat/long (for geoPoint only) worked too. Curiously a double[]{x,y} did not work, even though it is a valid source format. We did not investigate why.

…ce_wkb

See elastic#103947

iverase · 2024-01-08T12:48:56Z

@elasticmachine update branch

…ce_wkb

nik9000 · 2024-01-08T14:51:15Z

craigtaverner added 2 commits December 21, 2023 14:15

Initial support for BytesRefBlock to support WKB for Geometries

eb47224

This commit does not yet remove support for PointBlock, since we are investigating how to kabe multiple block types backing the same geometries.

elasticsearchmachine added the v8.13.0 label Dec 22, 2023

Update docs/changelog/103698.yaml

081468e

craigtaverner mentioned this pull request Dec 22, 2023

ESQL: Reading points from source #102880

Closed

10 tasks

Fixed changelog files

17ae512

nik9000 reviewed Dec 27, 2023

View reviewed changes

craigtaverner added 2 commits December 28, 2023 17:52

Remove support for PointBlock

4cbac42

We will use only WKB encoded in BytesRefBlock for now

Merge remote-tracking branch 'origin/main' into esql_points_from_sour…

3b0c89d

…ce_wkb

craigtaverner force-pushed the esql_points_from_source_wkb branch from 8aefa6b to 3b0c89d Compare December 28, 2023 17:05

craigtaverner added 8 commits December 28, 2023 18:12

One change from code review

895cad1

Removed some leftovers from the PointBlock removal

60985fe

Some code-review and TODO checks

35c5df0

Some code-review and TODO checks, and deal with Plan serialization

eeeffda

For plan serialization we needed to care that 8.12 expected point literals to serialize as encoded long values.

Fixed failing tests with point WKT rendering

0487203

Revert stab at implementing forStats for doc-values vs source

cfc4341

We can try this again in another PR

Disabled failing test

1e6e744

Use max precision when serializing points to XContent

8c5638e

craigtaverner marked this pull request as ready for review January 2, 2024 16:33

craigtaverner requested a review from a team as a code owner January 2, 2024 16:33

craigtaverner requested a review from iverase January 2, 2024 16:34

craigtaverner added 7 commits January 4, 2024 18:45

Removed one more artifact of using SpatialPoint in tests

572e60f

Some work towards removing creating SpatialPoints

f1a3600

Merge remote-tracking branch 'origin/main' into esql_points_from_sour…

0f1a58b

…ce_wkb

Updated GeoPointFieldMapperTests after merge from main

d0f6a12

Geo flaky tests to work more reliably

d1d4ce6

Removed defensive coding in plan serialization

084d08e

iverase reviewed Jan 5, 2024

View reviewed changes

Simplify and ensure error is thrown on wrong type

c176db1

iverase reviewed Jan 5, 2024

View reviewed changes

Do version checks on reading plan from PlanStreamInput

4509eab

iverase approved these changes Jan 5, 2024

View reviewed changes

iverase reviewed Jan 5, 2024

View reviewed changes

craigtaverner added 8 commits January 5, 2024 16:01

Merge remote-tracking branch 'origin/main' into esql_points_from_sour…

97f5589

…ce_wkb

Mute failing test

d18e3a9

We had this muted before, then fixed it, and now muted it again after elastic#103632 (ESQL: Check field exists before load from _source)

Use specific TransportVersions for point literal in query plans

a830c79

Since we might start testing ESQL in serverless any day now, this is a better option, as the previous approach was more stateful (relied on 8.12.x vs. 8.13.x).

Fixed failing test after merge with main

a11ebe0

The code in main was using an optimization that is not supported by GeoPointFieldMapper, but could be done as a future improvement.

Merge remote-tracking branch 'origin/main' into esql_points_from_sour…

1927ec6

…ce_wkb

Fixed compile error from previous fix with nullValues

bb74171

It seems mixed cluster yaml tests don't work

6500d83

See elastic#103947

elasticmachine and others added 2 commits January 8, 2024 23:19

Merge branch 'main' into esql_points_from_source_wkb

82d95b6

Merge remote-tracking branch 'origin/main' into esql_points_from_sour…

3fd2b18

…ce_wkb

craigtaverner merged commit 978082b into elastic:main Jan 8, 2024
15 checks passed

craigtaverner mentioned this pull request Jan 8, 2024

ESQL: GEO_POINT and CARTESIAN_POINT type support #102177

Merged

7 tasks

iverase mentioned this pull request Jan 11, 2024

ESQL: Support loading shapes from source into WKB blocks #104269

Merged

craigtaverner mentioned this pull request Jan 23, 2024

Spatial ES|QL MVP #103587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL Support loading points from source into WKB blocks #103698

ESQL Support loading points from source into WKB blocks #103698

craigtaverner commented Dec 22, 2023 •

edited

Loading

elasticsearchmachine commented Dec 22, 2023

nik9000 Dec 27, 2023

nik9000 Dec 27, 2023

craigtaverner Dec 29, 2023

craigtaverner Dec 29, 2023

nik9000 Dec 27, 2023

craigtaverner Jan 3, 2024

nik9000 Dec 27, 2023

craigtaverner Jan 3, 2024

nik9000 Dec 27, 2023

craigtaverner Dec 28, 2023

nik9000 Dec 27, 2023

craigtaverner Dec 28, 2023

nik9000 Dec 27, 2023

craigtaverner Dec 28, 2023

iverase Jan 3, 2024 •

edited

Loading

iverase Jan 3, 2024

nik9000 Jan 3, 2024

nik9000 Jan 3, 2024

nik9000 Dec 27, 2023

nik9000 Dec 27, 2023

craigtaverner Dec 28, 2023 •

edited

Loading

elasticsearchmachine commented Jan 2, 2024

elasticsearchmachine commented Jan 2, 2024

iverase Jan 5, 2024

craigtaverner Jan 5, 2024

iverase Jan 5, 2024

iverase Jan 5, 2024

iverase left a comment

iverase Jan 5, 2024

craigtaverner Jan 7, 2024

iverase commented Jan 8, 2024

nik9000 commented Jan 8, 2024

ESQL Support loading points from source into WKB blocks #103698

ESQL Support loading points from source into WKB blocks #103698

Conversation

craigtaverner commented Dec 22, 2023 • edited Loading

elasticsearchmachine commented Dec 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

craigtaverner Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

elasticsearchmachine commented Jan 2, 2024

elasticsearchmachine commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iverase commented Jan 8, 2024

nik9000 commented Jan 8, 2024

craigtaverner commented Dec 22, 2023 •

edited

Loading

iverase Jan 3, 2024 •

edited

Loading

craigtaverner Dec 28, 2023 •

edited

Loading