Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google BigQuery Support #945

Closed
rlanda opened this issue Aug 15, 2016 · 76 comments · Fixed by #4917
Closed

Google BigQuery Support #945

rlanda opened this issue Aug 15, 2016 · 76 comments · Fixed by #4917
Labels
enhancement:request Enhancement request submitted by anyone from the community

Comments

@rlanda
Copy link

rlanda commented Aug 15, 2016

Currently sqlalchemy does not support a bigquery dialect. Is there any way that BigQuery sources can be added to Caravel, apart from extending sqlalchemy to support BigQuery?

@xrmx
Copy link
Contributor

xrmx commented Aug 15, 2016

@rlanda take a look at druid support if you want an example on how this could be done

@xrmx xrmx added the enhancement:request Enhancement request submitted by anyone from the community label Aug 15, 2016
@burdiyan
Copy link

Someone has any news about that?

@hunt3r
Copy link

hunt3r commented Feb 21, 2017

Yeah we're looking at this as well.

@maxcorbeau
Copy link

Hi guys, any news on this? Thanks.

@hunt3r
Copy link

hunt3r commented Apr 9, 2017

Take a look at this thread. Some progress has been made on getting sqlalchemy to work with big query, but seems like it's not ready yet.
https://groups.google.com/forum/#!topic/sqlalchemy/-DzCGKFA9h4

There are also these projects:
https://pypi.python.org/pypi/sqlalchemy_bigquery/
https://github.com/cpdean/sqlalchemy-bigquery

@mistercrunch
Copy link
Member

mistercrunch commented Apr 10, 2017

There's 2 approaches here, either going through SqlAlchemy (which may or may not be possible, depending on how messed up the GSQL dialect is), the thing is we only need for a subset of the dialect to work (no need for DML joins, ...). The other option is to create a brand new Supserset connector for it.

I've done some work to make the connectors interface more extendable recently. It's a matter of of deriving the BaseTable, BaseColumn and BaseMetric models, creating the related ModelViews, and implementing a handful of of methods (ok well a bit more than that...). The most complex one is the query method, where you get a time filter, an array of metrics and columns to group by, a set of filters, whether the query is a time series of now, and somehow have to return a pandas dataframe.

A solid Python programmer familiar with Flask/Django should be able to implement this a matter of days, perhaps a week or two. I'd be open to mentor someone through the process. I'm sure we could host the engineer at Airbnb for a week while working on this task. Who's up for the challenge!?

@joshuacano
Copy link

joshuacano commented May 1, 2017

I would be open to working on this, if this has not been done. I'm familiar with Python and am using superset at my company internally to display various information from redshift (But would like to move us to BQ)

@mistercrunch
Copy link
Member

@joshuacano I'd be happy helping you out while working on this. I see you are in SF and we love to host at Airbnb so we could host you for coding sessions or for a few days while you crank that out.

I can also help improving and clarifying the connector interface while working on this so that it's more straightforward to add the next connectors.

@joshuacano
Copy link

Wow it would be real honor to meet you @mistercrunch. I'd love to come over for a few coding sessions. Maybe I can get familiar with the code base over the next few days and ping you on gitter once I have an idea of where to get started. Does that sound good?

@Bamieh
Copy link

Bamieh commented May 4, 2017

@joshwalters @mistercrunch thats great, i tried to tackle this issue on my own, but i believe having a spark connector instead of a bigquery is a much neater solution, since you can connect spark to bigquery and much more, hadoop or elastic search.

@lilila
Copy link

lilila commented May 4, 2017

@Bamieh , what do you use to connect spark to big query?

@joshuacano
Copy link

joshuacano commented May 12, 2017

@mistercrunch I've started a bit of work on this, Not sure if you use Gitter, but I Wrote you some questions on there to just point me in the right direction. Thanks!

@mistercrunch
Copy link
Member

[copied from gitter to keep the conversation here]
joshuacano @joshuacano May 11 22:10
I was just reviewing adding the connector and it seems like this involves essentially mimicking the behavior in superset/connectors/druid OR superset/connectors/sqla. But adapting it to use BQ. However you had indicated that you would make it a bit easier to add a new connector? Do you have any guidance for me here? Thanks so much again!

@mistercrunch
Copy link
Member

mistercrunch commented May 13, 2017

So two main approaches:

  1. Try to make BQ comply to what it needs to go through the sqla route in Superset. This requires essentially 2 things:
  • a SqlAlchemy dialect, that's easy-ish assuming that BQ SQL is close enough to ANSI SQL. Note that we only need a subset of the dialect defined, only the query part and the method to fetch metadata (tables, columns, views, ...)
  • a DBAPI interface: DBAPI is a python standard for how database interfaces should look like. connections and cursor objects, and a set of standard methods.

The pyhive project does both these things on top of both Presto and Hive and can be a source or inspiration here.

  1. The second approach would be to create a new big_query package under connectors, and create a brand new connector. This approach is a lot more flexible but it's much more code to write and maintain, and only helps Superset (where SQLALchemy/DBApi) helps BigQuery integration in other tools...

This has been discussed here:
googleapis/google-cloud-python#2434

@joshuacano
Copy link

So perhaps the interim solution for me is to go down the connector route, and then depending on the discussion in the issue referenced above, it should be fairly trivial to just turn off the bigquery connector section once that is done. However, it does seem like you might prefer option #1 (and it might be a more long term solution), so I'll defer to you on what direction I should head.

@mistercrunch
Copy link
Member

I'd go with solution 1 starting by forking that repo and moving it forward to where it needs to be. This would become a largeish PR against that repo.

@mistercrunch
Copy link
Member

BTW Druid added support for SQL and we'll be going down the same route eventually.

@vincentwongso
Copy link

Any updates on this feature?

@tswast
Copy link

tswast commented Jun 5, 2017

@vincentwongso I think this is currently blocked on my PR googleapis/google-cloud-python#2921

It's got a bit more work required before it can merge. Specifically, query parameters, executemany(), fetchmany(), and fetchall(). Once that db-api module is available, it should be easy to create a module to add SQLAlchemy support and thus Superset support.

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Jun 7, 2017

If anyone needs this feature ASAP, you can try the interface that I've implemented to use BigQuery in Superset: https://github.com/mxmzdlv/pybigquery

It is basic, but querying and query generation works fine. There is no support for Record type, so you'll have to flatten your schema using views.

You will also need to include this snippet of code in superset_config.py for Superset's Time Grain to work. I haven't created a PR as I am not sure if it is a complete engine spec.

Some issues that I've encountered while developing the interface:

  1. Not sure how to reference datasets. For now I've made them part of a table id, so the table name must contain the dataset, e.g. dataset.table
  2. As I understand, GROUP BY statement in BigQuery must reference expressions by their labels, e.g.:
SELECT TIMESTAMP_TRUNC(TIMESTAMP(ts), DAY) AS __timestamp, COUNT(*) AS count
FROM dataset.table
GROUP BY __timestamp

So I had to use this hack to make it work. Not sure if there is a better way.

@alamhanz
Copy link

thank you @mxmzdlv its working.. but I want to know about Time Grain.. How to activate it?

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Jun 12, 2017

@alamhanz You'll need to create superset_config.py and put this code in it. Make sure that you have wrapt module installed (pip install wrapt) and superset_config.py is in your PYTHONPATH.

Otherwise you can just add BQEngineSpec class from the snippet to superset/superset/db_engine_specs.py module

@alamhanz
Copy link

Thank you.. It is working when I put the BQEngineSpec Script on superset/superset/db_engine_specs.py

@alamhanz
Copy link

But, I can't use SQL LAB.
"Failed to start remote query on worker. Tell your administrator to verify the availability of the message queue."
this message is pop out after I ran my query.. what should i do?

@mxmzdlv
Copy link
Contributor

mxmzdlv commented Jun 12, 2017

@alamhanz I think this is because your Celery is not configured. Try disabling Allow Run Async in the database settings in superset.

@alamhanz
Copy link

@mxmzdlv What do you mean by "Celery is not configured"? Then how can I disabling Allow Run Async??

@mistercrunch
Copy link
Member

@alamhanz check out the installation docs around SQL Lab:
http://superset.apache.org/installation.html#sql-lab

Allow Run Async is a database configuration parameter. Edit the database entry for Big Query and uncheck the box.

mistercrunch added a commit to mistercrunch/superset that referenced this issue Jul 27, 2017
@sachitperi
Copy link

sachitperi commented May 1, 2018

@tswast @darylerwin @mistercrunch @mxmzdlv @Bamieh
Thanks to you guys (your guidance) I was finally able to connect my superset to bigquery in windows.
For some reason I was not able to set env variable through command promt or windows power shell.
I did it manually through mycomputer/properties/settings/environment variables.
I had to restart superset to make it work though.

@Bamieh Idid not try using docker for running this. I have only heard about docker. Have never practically used it or worked with it. It would be great if you could point me to some resources on it so that I can learn the same.

Now I can see tables and get their count and other things. Please refer below image for the same.
table from bigquery

But when I am trying to run the same query from sql lab I am not able to generate any result. It just keeps on running and timesout. The same is happening If I try to query from the default 'main' DB that gets installed for visualising World Bank Data and other dashboards. I get the following error

Module 'signal has no attribute 'SILGRAM'
and the query keeps on running for ever.
querying table from sql lab

Currently I am creating dashboards using a manual process (Upload csv).

Please help me resolve this error.

Also is there a way to share the dashboard made on superset to others who dont have superset (even python) installed on them. Can I share it privately within the organisation I work in?

@mxmzdlv
Copy link
Contributor

mxmzdlv commented May 1, 2018

@swamy16 seems to be an issue with calling signal(signal.SIGALRM) — SIGALRM is not available on Windows. https://github.com/apache/incubator-superset/blob/d533ce09676921ec2f72e1032bbca0d2a37f65b6/superset/utils.py#L508

I think the easiest way to handle this would be to run Superset in a Docker / Vagrant or just install Virtualbox with Linux on it and run Superset from there?

@sachitperi
Copy link

@mxmzdlv thanks for the help. I shall try installing superset in virtualbox.

also is it possible to share the dashboards I create on superset with others in my organisation who don't have superset or python installed?

mistercrunch added a commit to mistercrunch/superset that referenced this issue May 1, 2018
@darylerwin
Copy link

@swamy16 is that a typo in the alarm and perhaps related to this windows signal python issue?
Unity-Technologies/ml-agents#7

Basically it is timing out? Is the dataset too big?

@sachitperi
Copy link

sachitperi commented May 2, 2018

@darylerwin @mistercrunch

sorry it is a typo from my end. The actual error is SIGALRM as mentioned by @mxmzdlv
Module 'signal has no attribute 'SIGALRM'
The table are small (max size in GB's that too <30GB) as of now. But they can become big(can be as large as in TB's) in future. How big a dataset can Superset handle?

Basically I am able to add tables to superset and am able to visualize them in dashboards (Be it tables from bigquery, or the default main db "world bank dataset" ) but I am not able to query them (even the main db tables from superset sql lab)

image

As you can see the highlighted part in the above image "tsukuyomi" is a dataset in my bigquery and "2017_agency_mapping" is a table. I am able to create a slice view using tables from my bigquery.

image
As you can see in the above image I am querying from the main db of the default world bank. But it is not able to query the same in sql labs sql editor.
@mxmzdlv mentioned that this is happening due to SIGALRM not working in windows.

But technically shudn't this error also happen when I am creating a slice. Since it is effectively running the same query.

Please let em know if there is a way to solve this error in windows.
I tried replicate what ASPePeX mentioned in the link you have given Unity-Technologies/ml-agents#7
But I am not able to locate the file in my computer.
/python/unityagents/environment.py
I am also not sure if this is the right approach

@mistercrunch
Copy link
Member

I wouldn't recommend running on Windows since Superset isn't tested against it. You will hit problems like this one. Even if you'd fix this issue there could be regressions in the future since no one uses Windows to run Superset in production.

michellethomas pushed a commit to michellethomas/panoramix that referenced this issue May 24, 2018
* Chinese page

* Using react-intl-universal to improve multi language in react page

* Using react-intl-universal to improve multi language in react page

* react_intl_universal

* change

* change

* change

* change

* change

* change

* change

* merge

* multiple page in js

* merge

* merge

* merge

* merge

* Js Translations

* JS Translation

* JS Translations

* Js translation

* JS translations

* JS translations

* Js translaion

* JS en Translation

* JS Translation

* upgrade document

Fixing the damn build (apache#3179)

* Fixing the build

* Going deeper

[bugfix] only filterable columns should show up in FilterBox list (apache#3105)

* [bugfix] only filterable columns should show up in FilterBox list

* Touchups

Datasource cannot be empty (apache#3035)

add title description to model view (apache#3045)

* add title description to model view

* add missing import

Add 'show/hide totals' option to pivot table vis (apache#3101)

[bugfix] numeric value for date fields in table viz (apache#3036)

Bug was present only when using the NOT GROUPED BY option

fixes apache#3027

fix hive.fetch_logs (apache#2968)

add Zalando to the list of organizations (apache#3171)

docs: fixup installation examples code indentation (apache#3169)

[bugfix] fix bar order (apache#3180)

[bugfix] visualize flow error: 'Metric x is not valid' (apache#3181)

The metric name in the frontend doesn't match the one generated on the
backend. It turns out the explore view will default to the first
metric so specifying one isn't needed.

Fix the segment interval for pulling metadata (apache#3174)

The end of the interval would be on the truncated today date, which
means that you will exclude today. If your realtime ingestion job
runs shorter than a day, the metadata cannot be pulled from the
druid cluster.

Bump cryptography to 1.9 (apache#3065)

As 1.7.2 doesn't compile here with openssl 1.1.0f

Escaping the user's SQL in the explore view (apache#3186)

* Escaping the user's SQL in the explore view

When executing SQL from SQL Lab, we use a lower level API to the
database which doesn't require escaping the SQL. When going through
the explore view, the stack chain leading to the same method may need
escaping depending on how the DBAPI driver is written, and that is the
case for Presto (and perhaps other drivers).

* Using regex to avoid doubling doubles

[sqllab] improve Hive support (apache#3187)

* [sqllab] improve Hive support

* Fix "Transport not open" bug
* Getting progress bar to show
* Bump pyhive to 0.4.0
* Getting [Track Job] button to show

* Fix testzz

Add BigQuery engine specifications (apache#3193)

As contributed by @mxmzdlv on issue apache#945

[bugfix] fix merge conflict that broke Hive support (apache#3196)

Adding 'apache' to docs (apache#3194)

[druid] Allow custom druid postaggregators (apache#3146)

* [druid] Allow custom druid postaggregators

Also, fix the postaggregation for approxHistogram quantiles so it adds
the dependent field and that can show up in the graphs/tables.

In general, postAggregators add significant power, we should probably
support including custom postAggregators. Plywood has standard
postAggregators here, and a customAggregator escape hatch that allows
you to define custom postAggregators.

This commit adds a similar capability for Superset and a additional
field/fields/fieldName breakdown of the typical naming for dependent
aggregations, which should make it significantly easier to develop
approxHistogram and custom postAggregation-required dashboards.

* [druid] Minor style cleanup in tests file.

* [druid] Apply code review suggestions

* break out CustomPostAggregator into separate class. This just cleans
  up the creation of the postaggregator a little bit.
* minor style issues.
* move the function around so the git diff is more readable

add combine config for metrics in pivot table (apache#3086)

* add combine config for metrics in pivot table

* change method to stack/unstack

* update backendSync

Autofocus search input in VizTypeControl modal onEnter (apache#2929)

Speed up JS build time (apache#3203)

Also bumping a few related libs

JS Translation

JS translations

js translation

fix issue 3204 (apache#3205)

[bugfix] capture Hive job_id pre-url transformation (apache#3213)

js translation

fix issue 3204 (apache#3205)

[bugfix] capture Hive job_id pre-url transformation (apache#3213)

[docs] update url in CONTRIBUTING.md (apache#3212)

[sqllab/cosmetics] add margin-top for labels in query history (apache#3222)

[explore] nvd3 sort values in rich tooltip (apache#3197)

[sqllab] fix UI shows 'The query returned no results' momentarily (apache#3214)

this is visible when running async queries between the fetching and
success state as the rows are getting cached in the component

[explore] DatasourceControl to pick datasource in modal (apache#3210)

* [explore] DatasourceControl to pick datasource in modal

Makes it easier to change datasource, also makes it such that the list
of all datasources doesn't need to be loaded upfront.

* Adding more metadata

* Js translation

* js tran

* js trans

* js trans

* js tran

* js trans

* js trans

* js tran

* js translation

* js trans

* js translation

* try load language pack async

* Backend translations things

* create language pack inside common data

* performance improvement for js i18n.

- js bundle should not contain localized content
- we populate translation content from server-side, in boostrap.common.language_pack
- in client-side, use promise to wrap around translation content. text will be translated after translation content arrived/parsed.
- fix linting

* fix Timer unit test

* 1. add global hook for all tests, to make translation pack avaialble before each test starts.
2. fix unit test for Timer component
3. remove noused method get_locale, and modules
4. fix page reload after user change page language

* parse and build i18n dictionary as a module

* fix sync-backend task, which should run without DOM
michellethomas pushed a commit to michellethomas/panoramix that referenced this issue May 24, 2018
@dirkbosman
Copy link

Just to make sure...one can now use BigQuery via SQLAlchemy in Superset?

Not seeing any updates in the documentation: https://github.com/apache/incubator-superset yet

@darylerwin
Copy link

darylerwin commented May 24, 2018 via email

@dirkbosman
Copy link

Great! Thanks for the quick reply :) If I can add my 2 cents. Would be great to see an example of how to do the complete set-up in steps and perhaps an example covering the querying of a public BQ Dataset: https://cloud.google.com/bigquery/public-data/ It would help to explain the difference between running query in Standard SQL vs Legacy SQL.

@mistercrunch
Copy link
Member

@darylerwin @dirkjohannesbosman I just updated the docs @ http://superset.apache.org/installation.html#database-dependencies

@sachitperi
Copy link

@dirkjohannesbosman
as mentioned by @darylerwin and @mistercrunch it is easy to use bigquery via sqlalchemy. You will have to use standard sql though for querying stuff which is quite different from legacy sql.
Let me know if you face any issues in connecting to bigquery.

@dirkbosman
Copy link

Great, thanks guys for the reply and the docs update 👍

@amodig
Copy link

amodig commented May 30, 2018

How do you add BigQuery tables in the Sources -> Tables view? With standard SQL the string is usually project.dataset.table, and I tried to input them in multiple ways in the Schema and Table Name fields. The error I get is either
Table [myproject.mydataset.mytable] could not be found, please double check your database connection, schema, and table name
or
Table [mydataset.mytable] could not be found, please double check your database connection, schema, and table name

My BQ connection works and I can make queries in SQL Lab. I'm using pybigquery.

@sachitperi
Copy link

Hi @amodig Not sure I understood your question correctly.
You might be entering it the wrong way.
In Superset when you click on ad new table
Databases: Select the name you gave to your bigquery connector
Schema: you need to select the name of the dataset in your bigquery (Each of your projects can contain multiple datasets)
Table name: Select the name of the table present in your dataset

Let me know if this helps.

@Bamieh
Copy link

Bamieh commented May 30, 2018

@mistercrunch I dont want to be salty but can you lock this thread since it has been closed and resolved, any issues related to this can be opened as new issues.

@amodig
Copy link

amodig commented May 30, 2018

@swamy16 Ah, NVM, I noticed a GCP Service Account permission issue with SQL Lab. The connection credentials didn't have enough permission to query BQ. I put the dataset name into the Schema field and it worked, thanks!

timifasubaa pushed a commit to timifasubaa/incubator-superset that referenced this issue May 31, 2018
@bzhr
Copy link

bzhr commented Jul 13, 2018

What is the status of this issue? Is somebody using Superset with BigQuery in production?

  • Is the connector based on pybigquery?
  • Is the missing record-type a big issue?

I'm evaluating Superset for my next project...

@mistercrunch
Copy link
Member

@Bamieh I don't think it's possible to lock a Github issue. @bzhr the state is closed since it's been resolved, this thread shows that people have succeeded in using Superset against BQ

@Bamieh
Copy link

Bamieh commented Jul 15, 2018

@mistercrunch There is a button under participants to lock conversation, but i believe it needs certain privilages

here is an example from a repo i own:
image

@mistercrunch
Copy link
Member

Oh good, I had not discovered that feature.

@apache apache locked as resolved and limited conversation to collaborators Jul 16, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this issue Nov 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement:request Enhancement request submitted by anyone from the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.