Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extraction function support for Druid queries #4740

Merged
merged 7 commits into from
May 9, 2018

Conversation

jasnovak
Copy link
Contributor

@jasnovak jasnovak commented Apr 2, 2018

Adds support for using extraction functions in Druid. Apply the extraction function in the queries, as well as filters.

NOTE: this PR requires the latest pydruid branch on master. Specifically, this commit: druid-io/pydruid@875ae4c

@mistercrunch
Copy link
Member

Can you bump the pydruid dep to 0.4.2 as part of this PR?
https://github.com/apache/incubator-superset/blob/master/setup.py#L78

@codecov-io
Copy link

codecov-io commented Apr 4, 2018

Codecov Report

Merging #4740 into master will increase coverage by 0.12%.
The diff coverage is 75.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4740      +/-   ##
==========================================
+ Coverage   76.97%   77.09%   +0.12%     
==========================================
  Files          44       44              
  Lines        8537     8576      +39     
==========================================
+ Hits         6571     6612      +41     
+ Misses       1966     1964       -2
Impacted Files Coverage Δ
superset/connectors/druid/models.py 80.54% <75.43%> (-0.63%) ⬇️
superset/data/__init__.py 100% <0%> (ø) ⬆️
superset/cli.py 44.06% <0%> (ø) ⬆️
superset/dataframe.py 97.7% <0%> (ø) ⬆️
superset/views/utils.py 93.93% <0%> (ø) ⬆️
superset/extract_table_names.py 0% <0%> (ø) ⬆️
superset/models/core.py 86.54% <0%> (ø) ⬆️
superset/db_engines/hive.py 0% <0%> (ø) ⬆️
superset/models/sql_lab.py 98.59% <0%> (ø) ⬆️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa3da8c...5197a1e. Read the comment docs.

@mistercrunch
Copy link
Member

mistercrunch commented Apr 4, 2018

@fabianmenges do you all use extraction functions much? Do you think you can review/test this?

@fabianmenges
Copy link
Contributor

I don't think so, will verify.

@mistercrunch
Copy link
Member

@fabianmenges @john-bodley I'm identifying this PR as something that could break things in your environment but I can't really test it in my environment since we don't have a lot of charts on Druid at this point. Can you 2 review/greenlight this?

@fabianmenges
Copy link
Contributor

Ack, will do.

@john-bodley
Copy link
Member

@mistercrunch I don't believe we use extraction functions in any of the Druid clusters connected to Superset (per a quick code search).

@jeffreythewang
Copy link
Contributor

We currently don't use extraction functions either.

This may have already been answered, but I'm curious why you didn't use the pydruid DimensionSpec builder?

@jasnovak
Copy link
Contributor Author

jasnovak commented May 2, 2018

Hmm, I'm not sure where you are suggesting using DimensionSpec in this PR. Or are you asking that about the code in general? As far as the filters that this PR is creating, as I note in the code, I couldn't use the Dimension or Bound classes as they don't support extraction functions.

@jeffreythewang
Copy link
Contributor

jeffreythewang commented May 2, 2018

My bad, I thought you were attaching extractionFns to DimensionSpecs as well like in this way, but I realized you are building the Filters from DimensionSpecs that already have extractionFns defined.

I assume you have users (or someone else) define their own extractionFns in the datasource configuration?

@SpyderRivera
Copy link
Contributor

The filtering works when an extractionFn is added to the dimension spec of the lookup, but then the look up doesn't work as a column.

  "dimension": "country",
  "outputName": "country_group",
  "outputType": "STRING",
  "retainMissingValue": true,
  "type": "selector",
  "extractionFn": {
    "type": "lookup",
    "lookup": {
      "map": {
        "Algeria": "EMEA",
        "Faroe Islands": "EMEA",
        "Luxembourg": "EMEA",
        "Andorra": "EMEA",
        "Cameroon": "EMEA",
        "Burkina Faso": "EMEA",
        "Benin": "EMEA",
        "Bahrain": "EMEA",
        "Aland Islands": "EMEA",
        "Australia": "AU/NZ",
        "Iceland": "EMEA",
        "Cape Verde": "EMEA",
        "Germany": "EMEA",
        "Bosnia and Herzegovina": "EMEA",
        "Belgium": "EMEA",
        "Netherlands": "EMEA",
        "Central African Republic": "EMEA",
        "Belarus": "EMEA",
        "Albania": "EMEA",
        "Switzerland": "EMEA",
        "New Zealand": "AU/NZ",
        "Bulgaria": "EMEA",
        "Greenland": "EMEA",
        "Angola": "EMEA",
        "Chad": "EMEA",
        "Svalbard and Jan Mayen": "EMEA",
        "Comoros": "EMEA",
        "Austria": "EMEA",
        "Burundi": "EMEA",
        "Botswana": "EMEA"
      },
      "type": "map",
      "isOneToOne": false
    }
  }
}

VS

{
    "type": "lookup",
    "dimension": "country",
    "outputName": "country_group",
    "outputType": "STRING",
    "retainMissingValue": true,
    "lookup": {
      "type": "map",
      "map": {
        "Australia": "AU/NZ",
        "New Zealand": "AU/NZ",
        "Aland Islands": "EMEA",
        "Belgium": "EMEA",
        "Faroe Islands": "EMEA",
        "Greenland": "EMEA",
        "Iceland": "EMEA",
        "Luxembourg": "EMEA",
        "Netherlands": "EMEA",
        "Svalbard and Jan Mayen": "EMEA",
        "Austria": "EMEA",
        "Germany": "EMEA",
        "Switzerland": "EMEA",
        "Albania": "EMEA",
        "ALgeria": "EMEA",
        "Andorra": "EMEA",
        "Angola": "EMEA",
        "Bahrain": "EMEA",
        "Belarus": "EMEA",
        "Benin": "EMEA",
        "Bosnia and Herzegovina": "EMEA",
        "Botswana": "EMEA",
        "Bulgaria": "EMEA",
        "Burkina Faso": "EMEA",
        "Burundi": "EMEA",
        "Cameroon": "EMEA",
        "Cape Verde": "EMEA",
        "Central African Republic": "EMEA",
        "Chad": "EMEA",
        "Comoros": "EMEA"
      },
      "isOneToOne": false
    }
  }

Here you can see the filter works, but the column values for country group are not populating.
screen shot 2018-05-02 at 2 27 16 pm

@SpyderRivera
Copy link
Contributor

@jasnovak Is the intent to have each lookup have an exFn? Or could the exFn be applied at filter time only?

@jasnovak
Copy link
Contributor Author

jasnovak commented May 3, 2018

@jeffreythewang yes, the extraction functions are defined in the druid datasource configuration, e.g. "Edit Druid Datasource"

@jasnovak
Copy link
Contributor Author

jasnovak commented May 3, 2018

@SpyderRivera I'm not sure I understand your question. The PR is just for folks that want to use the extraction functions -- eg. mapping or regex, in their column definitions. Without this PR, if you filtered on a column that used a mapping or regex extraction fn in its spec, the filter would fail.

As far the 'lookup' type syntax, I understand they are an experimental feature in Druid per http://druid.io/docs/latest/querying/dimensionspecs.html. I don't know if Superset supports them, looks like maybe not. However, this PR doesn't address lookups. I'm just adding support for extraction functions.

@SpyderRivera
Copy link
Contributor

I see. I was hoping it would fix the filters not working on lookups. Nvm

@jasnovak
Copy link
Contributor Author

jasnovak commented May 3, 2018

Just curious, is there a reason you can't define your map using extraction functions?

@SpyderRivera
Copy link
Contributor

You mean like I did above? Is there a better way?

@jasnovak
Copy link
Contributor Author

jasnovak commented May 4, 2018

Assuming what you want is a column with mapped values, you can use your first definition above, with the lookup in the extraction function, to do that. This PR will allow those mapped values in the column to be used in a filter -- without it, the filter will fail.

@mistercrunch mistercrunch merged commit e29beba into apache:master May 9, 2018
@SpyderRivera
Copy link
Contributor

@jasnovak I can get the filter to work with this change OR the map to work, but not both at the same time. I can create two columns one to filter on and the other to group by. However, that is an ugly hack. Have you been successful with both working?

@jasnovak
Copy link
Contributor Author

jasnovak commented May 9, 2018

Yes, we do both: we create a mapped column using an extraction function, and use it to filter. Do you click both the "groupable" and "filterable" box when you create your column?

Also, perhaps there's something amiss in your definition? Try defining like this:

{
"type": "extraction",
"dimension": "country",
"outputName": "country_group",
"outputType": "STRING",
"extractionFn": {
"type": "lookup",
"dimension": "dimensionName",
"outputName": "dimensionOutputName",
"replaceMissingValueWith": "missing_value",
"retainMissingValue": false,
"lookup": {
"type": "map",
"map": {
"Algeria": "EMEA",
"Faroe Islands": "EMEA",
"Luxembourg": "EMEA",
...

@SpyderRivera
Copy link
Contributor

Type "extraction" appears to be where I was failing. Thanks so much!bitmoji

michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features
mistercrunch pushed a commit to lyft/incubator-superset that referenced this pull request May 30, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
mistercrunch pushed a commit to lyft/incubator-superset that referenced this pull request May 30, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
timifasubaa pushed a commit to timifasubaa/incubator-superset that referenced this pull request May 31, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features
mistercrunch pushed a commit to lyft/incubator-superset that referenced this pull request Jun 4, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Jun 7, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Jun 7, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Jun 7, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
mistercrunch pushed a commit to lyft/incubator-superset that referenced this pull request Jun 7, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
hughhhh pushed a commit to lyft/incubator-superset that referenced this pull request Jun 27, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features

(cherry picked from commit e29beba)
@srggrs
Copy link
Contributor

srggrs commented Sep 19, 2018

Guys thank you for this post!
I tried to use the extraction function as per below:

{
"type" : "extraction",
"dimension" : "myid",
"outputName" : "myname",
"outputType" : "STRING",
"extractionFn" : {
"type" : "lookup",
"dimension" : "myid",
"outputName" : "myname",
"retainMissingValue" : false,
"replaceMissingValueWith" : "missing",
"name" : "myid"
}
}

but I get an error:
HTTP Error 500: Internal Server Error Druid Error

I'm not sure what I'm doing wrong.
I implemented the druid lookup as a globally cached lookup. JSON below it works for the gropby but not filtering though.

{
"type" : "lookup",
"dimension" : "myid",
"outputName" : "myname",
"retainMissingValue" : false,
"replaceMissingValueWith" : "missing",
"name" : "myid"
}

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* add extraction fn support for Druid queries

* bump pydruid version to get extraction fn commits

* update and add tests for druid for filters with extraction fns

* conform to flake8 rules

* fix flake8 issues

* bump pyruid version for extraction function features
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.26.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.26.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants