Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rockset data source improvements #4076

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 32 additions & 10 deletions redash/query_runner/rockset.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from multiprocessing.pool import ThreadPool

import requests
from redash.query_runner import *
from redash.utils import json_dumps
Expand All @@ -23,7 +25,8 @@ def __init__(self, api_key, api_server):
self.api_server = api_server

def _request(self, endpoint, method='GET', body=None):
headers = {'Authorization': 'ApiKey {}'.format(self.api_key)}
headers = {'Authorization': 'ApiKey {}'.format(self.api_key),
'User-Agent': 'rest:redash/1.0'}
url = '{}/v1/orgs/self/{}'.format(self.api_server, endpoint)

if method == 'GET':
Expand All @@ -35,9 +38,17 @@ def _request(self, endpoint, method='GET', body=None):
else:
raise 'Unknown method: {}'.format(method)

def list(self):
response = self._request('ws/commons/collections')
return response['data']
def list_workspaces(self):
response = self._request('ws')
return [x['name'] for x in response['data'] if x['collection_count'] > 0]

def list_collections(self, workspace='commons'):
response = self._request('ws/{}/collections'.format(workspace))
return [x['name'] for x in response['data']]

def collection_columns(self, workspace, collection):
response = self.query('DESCRIBE "{}"."{}"'.format(workspace, collection))
return list(set([x['field'][0] for x in response['results']]))

def query(self, sql):
return self._request('queries', 'POST', {'sql': {'query': sql}})
Expand Down Expand Up @@ -76,12 +87,23 @@ def __init__(self, configuration):
'api_server', "https://api.rs2.usw2.rockset.com"))

def _get_tables(self, schema):
for col in self.api.list():
table_name = col['name']
describe = self.api.query('DESCRIBE "{}"'.format(table_name))
columns = list(set(map(lambda x: x['field'][0], describe['results'])))
schema[table_name] = {'name': table_name, 'columns': columns}
return schema.values()
pool = ThreadPool(processes=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some configurations (and this is going to become the default) we run Redash's API server with gevent's monkeypatching. I'm not sure how "healthy" it is to mix threads with gevent.

How about we land this PR without the threads optimization and then revisit this in a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @arikfr, thanks for the feedback. Given that future direction, would it make sense to replace this multithreaded approach with grequests? I was originally leaning in that direction actually but figured this would be a less invasive change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the default deployment doesn't use gevent and we don't have a timeline for the switch. So for now I suggest we don't use anything that depends on it.


try:
workspaces = self.api.list_workspaces()
collections = pool.map(self.api.list_collections, workspaces)

args = [(w, c) for (w, cols) in zip(workspaces, collections) for c in cols]
describe_results = pool.map(lambda (w, c): self.api.collection_columns(w, c), args)

for (w, c), columns in zip(args, describe_results):
table_name = c if w == 'commons' else '{}.{}'.format(w, c)
schema[table_name] = {'name': table_name, 'columns': columns}

return sorted(schema.values(), key=lambda x: x['name'])

finally:
pool.close()

def run_query(self, query, user):
results = self.api.query(query)
Expand Down