Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range Aggregation is not working properly for float fields #77033

Closed
ofirEdi opened this issue Aug 30, 2021 · 8 comments · Fixed by #78344
Closed

Range Aggregation is not working properly for float fields #77033

ofirEdi opened this issue Aug 30, 2021 · 8 comments · Fixed by #78344
Assignees
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@ofirEdi
Copy link

ofirEdi commented Aug 30, 2021

Elasticsearch version (bin/elasticsearch --version): 7.12.1

Plugins installed: []

JVM version (java -version):1.8.0_292

OS version (uname -a if on a Unix-like system): Ubuntu 18.04.1

Description of the problem including expected versus actual behavior:
According to the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/7.12/search-aggregations-bucket-range-aggregation.html range aggreagation creates buckets in a way that from is inclusive and to is exclusive. When working with float fields Elasticsearch seems to work vise versa and classify to the wrong bucket.

Steps to reproduce:

  1. create index with float field:
PUT test1
{
  "settings": {
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "amount": {
        "type": "float"
      }
    }
  }
}
  1. insert a doc:
PUT test1/_doc/1
{
  "amount": 0.04
}
  1. make range aggregation with limits of 0.04
POST test1/_search
{
  "size": 0,
  "aggs": {
    "testRange": {
      "range": {
        "field": "amount",
        "ranges": [
          {
            "from": 0.01,
            "to": 0.04
          },
          {
            "from": 0.04,
            "to": 0.06
          }
        ]
      }
    }
  }
}

the result i'm getting is that the doc belongs to 0.01-0.04 bucket and not to 0.04-0.06 bucket:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "testRange" : {
      "buckets" : [
        {
          "key" : "0.01-0.04",
          "from" : 0.01,
          "to" : 0.04,
          "doc_count" : 1
        },
        {
          "key" : "0.04-0.06",
          "from" : 0.04,
          "to" : 0.06,
          "doc_count" : 0
        }
      ]
    }
  }
}

I tested with Integers and it seems to be fine but for floats the aggregation not behaving like in the documentation.

@ofirEdi ofirEdi added >bug needs:triage Requires assignment of a team area label labels Aug 30, 2021
@not-napoleon not-napoleon added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Aug 30, 2021
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@not-napoleon
Copy link
Member

Hi @ofirEdi ,

Thanks for submitting this. I ran your repro steps and it also fails for me. Feels like it might be a rounding bug. I'll look into it and see if I can track down a fix.

@ofirEdi
Copy link
Author

ofirEdi commented Aug 31, 2021

Thank you @not-napoleon . I'll keep watching.

@ofirEdi
Copy link
Author

ofirEdi commented Sep 14, 2021

HI @not-napoleon any update?

@not-napoleon
Copy link
Member

Hi @ofirEdi, I actually haven't had a chance to start on this. I've had a few other big things come up, unfortunately. It is next on my list though. Thank you for your patience.

@ofirEdi
Copy link
Author

ofirEdi commented Sep 14, 2021

Hi @not-napoleon, thank you for the update.

@not-napoleon not-napoleon self-assigned this Sep 20, 2021
@not-napoleon
Copy link
Member

Okay, so I've done some digging, and the issue here is (basically) that we always parse the range end points as doubles, but the field value gets parsed (and stored) as a float. Then we do our comparison, and 0.04f < 0.04d == true. This happens because 0.04 isn't exactly representable as a double or float, and they cut off the approximation at different points.

I'm looking into a solution, but it's a little tricky. I think the right thing to do is parse the range end points with the type of the field being aggregated on, if we can. Problem is we don't know the field type until after parse time. Anyway, I'm working on it. Hopefully will have something soon.

@ofirEdi
Copy link
Author

ofirEdi commented Sep 29, 2021

@not-napoleon i will keep following on the pull request. Thanks for taking the time and address this issue.

not-napoleon added a commit that referenced this issue Oct 11, 2021
This fixes a bug where the range aggregation always treats the range end points as doubles, even if the field value doesn't have enough resolution to fill a double. This was creating issues where the range would have a "more precise" approximation of an unrepresentable number than the field, causing the value to fall in the wrong bucket.

Note 1: This does not resolve the case where we have a long value that is not precisely representable as a double. Since the wire format sends the range bounds as doubles, by the time we get to where this fix is operating, we've already lost the precision to act on a big long. Fixing that problem will require a wire format change, and I'm not convinced it's worth it right now.

Note 2: This is probably still broken for ScaledFloats, since they don't implement NumberFieldType.

Resolves #77033
not-napoleon added a commit to not-napoleon/elasticsearch that referenced this issue Oct 11, 2021
)

This fixes a bug where the range aggregation always treats the range end points as doubles, even if the field value doesn't have enough resolution to fill a double. This was creating issues where the range would have a "more precise" approximation of an unrepresentable number than the field, causing the value to fall in the wrong bucket.

Note 1: This does not resolve the case where we have a long value that is not precisely representable as a double. Since the wire format sends the range bounds as doubles, by the time we get to where this fix is operating, we've already lost the precision to act on a big long. Fixing that problem will require a wire format change, and I'm not convinced it's worth it right now.

Note 2: This is probably still broken for ScaledFloats, since they don't implement NumberFieldType.

Resolves elastic#77033
not-napoleon added a commit that referenced this issue Oct 11, 2021
…) (#78932)

* Scale doubles to floats when necessary to match the field (#78344)

This fixes a bug where the range aggregation always treats the range end points as doubles, even if the field value doesn't have enough resolution to fill a double. This was creating issues where the range would have a "more precise" approximation of an unrepresentable number than the field, causing the value to fall in the wrong bucket.

Note 1: This does not resolve the case where we have a long value that is not precisely representable as a double. Since the wire format sends the range bounds as doubles, by the time we get to where this fix is operating, we've already lost the precision to act on a big long. Fixing that problem will require a wire format change, and I'm not convinced it's worth it right now.

Note 2: This is probably still broken for ScaledFloats, since they don't implement NumberFieldType.

Resolves #77033
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants