Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Add option to drop null fields #102428

Merged
merged 17 commits into from
Jan 17, 2024

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Nov 21, 2023

This adds an option to drop columns that are entirely null from the results. Is this something we want?

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option slows things down, which is a bit wild to me. I suppose it's because we have so many pages. I wonder if we can grow the pages....

@costin
Copy link
Member

costin commented Nov 28, 2023

I'd rather look into adding a dedicated command for this type of scenario: e.g. drop column_x if null

@nik9000
Copy link
Member Author

nik9000 commented Nov 28, 2023

Like DROP * IF NULL?

@nik9000
Copy link
Member Author

nik9000 commented Nov 29, 2023

I ran some performance tests and the service time went up slightly (bad), but request rate went way up (good). I think having some way for us to remove null columns is probably good.

@nik9000 nik9000 added >enhancement ES|QL-ui Impacts ES|QL UI labels Jan 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@nik9000
Copy link
Member Author

nik9000 commented Jan 9, 2024

@costin and I've talked a bit more about this and it seems like we do indeed just want this as a url parameter.

@stratoula, this is the option I talked to you about earlier. I'm adding a few more tests and will remove draft soon.

@nik9000 nik9000 marked this pull request as ready for review January 9, 2024 16:17
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@nik9000
Copy link
Member Author

nik9000 commented Jan 9, 2024

This doesn't work with async. I'll have a look at that too.

@@ -43,10 +43,6 @@ public SingletonOrdinalsBuilder appendOrd(int value) {
return this;
}

int[] ords() {
return ords;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't used and I bumped into it while adding the builder tests.

@nik9000
Copy link
Member Author

nik9000 commented Jan 9, 2024

I'll have a look at supporting this parameter in the async api after this PR.

@nik9000
Copy link
Member Author

nik9000 commented Jan 10, 2024

@stratoula, could you give this one a shot on your side and see if it's what you are looking for?

Comment on lines 400 to 403
"null_columns":[{"name":"all_null","type":"integer"}],""" + """
"columns":[{"name":"foo","type":"integer"}],""" + """
"values":[[40],[80]]}""")
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit - we might want to preserve the order of the columns even when removing null columns.
One way to do that is to list the columns as in the regular response but indicate it is null. we can then either skip or add a null in place:

"columns":[{"name":"foo","type":"integer"}, {"name":"all_null", "type":"integer", "null" : true}]
"values":[[40],[80]]

or

"values":[[40, null],[80, null]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we wanna go there, since the task at hand is clear - "drop null columns", but a more generalise view of this is "constant_columns" - a column with a constant value, which in this case would be null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we wanna go there, since the task at hand is clear - "drop null columns", but a more generalise view of this is "constant_columns" - a column with a constant value, which in this case would be null.

Oh that's kind of neat! It's quite easy for us to detect constant columns too.

Small nit - we might want to preserve the order of the columns even when removing null columns.

I like the idea. Then I don't like the idea. Then I do. I dunno. I kind of like that, no matter what, the way to parse the columns is by looking at the columns header. You just one for one them. No need for checking. As it stands now if you send this parameter the null columns just "disappear" from the parsing logic. They are only sitting off to the side as metadata. Like, no need to have different parsing code to handle this case.

OTOH, Chris' quite neat proposal around constant columns would be much more compatible with keeping one list of columns. Just sometimes having a constant. I suppose @stratoula could do either one. I wonder what our friends who work on the clients think. I'll ask!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One interesting thing about extracting constant values is that it's something that would likely also come as part of supporting some different output format like parquet or arrow flight or whatever. I presume those formats have specifications for, well, the entire response shape. And we'd do whatever makes sense for them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it clear, we're debating between:

{
  "null_columns": [
    {"name": "n1", "type": "long"},
    {"name": "n2", "type": "keyword"}
  ],
  "columns": [
    {"name": "r1", "type": "long"},
    {"name": "r2", "type": "long"},
  },
  "values": [
    [1, 2],
    [3, 4]
  ]
}

And

{
  "columns": [
    {"name": "n1", "type": "long", "constant_value": null},
    {"name": "r1", "type": "long"},
    {"name": "n2", "type": "keyword", "constant_value": null},
    {"name": "r2", "type": "long"},
  },
  "values": [
    [1, 2],
    [3, 4]
  ]
}

The first one is the "when you enable this option null columns just vanish if you don't update your parser". The second one is "you have to update your parser, but it's nice and compatible with the (later) proposal for constant columns".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, I've discussed this with Nik and come with the following conclusion:
columns and values should be kept in sync to avoid breaking existing parsing for consumers that set the option and don't care about the null values
for those that do, to maintain order, the list of all columns (null and not null) is returned as a separate section.
So something like:

{
  "all_columns": [
    {"name": "n1", "type": "long"},
    {"name": "r1", "type": "long"},
    {"name": "n2", "type": "keyword"}
    {"name": "r2", "type": "long"},
  ],
  "columns": [
    {"name": "r1", "type": "long"},
    {"name": "r2", "type": "long"}
  ],
  "values": [
    [1, 2],
    [3, 4]
  ]
}

This way clients that do care about the null columns or the available info have the data available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in order to find the columns with nulls you need to find the difference between all_columns and columns? Do I get this correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The advantage of this way is you get everything back in the right order. I figured someone would want that. We talked about putting a always_null on the column in all_columns so you could do it without the set difference operator. I decided to do it if someone asked for it but not at first.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@stratoula
Copy link

I will check locally today, I will let you know Nik!

@stratoula
Copy link

@nik9000 yes this will work just fine! A PoC from me based on your PR elastic/kibana#174585

@nik9000
Copy link
Member Author

nik9000 commented Jan 10, 2024

Once we've settled on an output format I'll write some docs for this.

@nik9000
Copy link
Member Author

nik9000 commented Jan 12, 2024

Ok friends! I've pushed code changes to get the new format. No docs yet. I have some test failures to stomp this morning. Then I'll try and get back to the docs.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stratoula
Copy link

I also updated my PoC with the latest format. All good 👍

@nik9000 nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 16, 2024
@nik9000
Copy link
Member Author

nik9000 commented Jan 17, 2024

run elasticsearch-ci/docs

@elasticsearchmachine elasticsearchmachine merged commit 919d282 into elastic:main Jan 17, 2024
15 checks passed
@nik9000 nik9000 deleted the esql_skip_null branch January 17, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement ES|QL-ui Impacts ES|QL UI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants