Feature: async rpc (#1706) #1735

beckjake · 2019-09-10T18:20:31Z

Implement async tasks, mostly as described in #1706, with one big exception: On errors, an error object is returned directly instead of "status": "error" - it was just much easier to implement that way.

All the identified async methods now return a request_token instead of their results.
Adds a poll method that takes one mandatory argument, the request_token, and returns a PollResult on success, or the same error as before if the polled task got an error.

A PollResult always has a status field, which can be one of ['not started', 'initializing', 'running', 'success']
if the status is success, it also comes with request-type-specific result data.
Polling for a missing request token is an error.

By default, garbage collection is checked on every json-rpc method:

maxsize defaults to 1000. If there are more than maxsize tasks, the oldest (number_of_tasks - maxsize) that have finished are cleaned up
auto_reap_age defaults to 30 days.
reapsize defaults to 500. If there are more than reapsize tasks, all tasks that are older than the current auto_reap_age are garbage collected

All those numbers are totally invented by me and are very much arbitrary.

I also added a gc method that forces garbage collection and allows you to tweak GC settings:

the task_ids parameter, defaults to None and is a list of task IDs to garbage collect directly
the before parameter, defaults to None and is a datestamp. All tasks that finished after the datestamp will be garbage collected
the settings parameter, defaults to None and accepts an object with the below schema. This schema sets the internal GC tuning values.

GCSettings schema:

{
  'type': 'object'
  'additionalProperties': False,
  'properties': {
    'auto_reap_age': {'type': 'number'},
    'maxsize': {'type': 'integer'},
    'reapsize': {'type': 'integer'}
  },
  'required': ['maxsize', 'reapsize', 'auto_reap_age'],
}

beckjake · 2019-09-10T18:21:30Z

core/dbt/config/profile.py

@@ -20,7 +20,7 @@
 DEFAULT_THREADS = 1
 DEFAULT_PROFILES_DIR = os.path.join(os.path.expanduser('~'), '.dbt')
 PROFILES_DIR = os.path.expanduser(
-    os.environ.get('DBT_PROFILES_DIR', DEFAULT_PROFILES_DIR)
+    os.getenv('DBT_PROFILES_DIR', DEFAULT_PROFILES_DIR)


I changed all os.environ.get -> os.getenv for grep-ability reasons when I was adding some secret magic threading-related flags to the RPC server.

beckjake · 2019-09-10T18:22:28Z

core/dbt/contracts/results.py

+        return iter(self.results)
+
+    def __getitem__(self, idx):
+        return self.results[idx]


These changes make results a lot like the old thing where we returned result.results, which meant I didn't have to update a million tests. Also, it just seemed like a nice thing.

drewbanin

this is just.... chef's kiss 👨‍🍳

I need to do another pass or two through the code here, but I found some minor things that are worth tweaking while smoke testing some requests.

it looks like some types of request validation errors only show up in the poll method, but I would expect them to fail at request-time. One such example: submitting a run task without specifying a name. I'd expect the request to fail, but it's accepted and the failure is only apparent when we poll for the request ("handle_request() missing 1 required positional argument: 'name'"). Is this a minor change, or is there a good reason why it should work the way it's implemented here?
There are some superfluous log lines that we should suppress:
- lot's of Snowflake OCSP Responder noise (This connection is in OCSP Fail Open Mode. TLS Certificates would be checked...) [channel=snowflake.connector.connection, snowflake.connector.cursor, snowflake.connector.json_result]
- some boto3-looking lines: (Starting new HTTPS connection (1): ...) [channel=botocore.vendored.requests.packages.urllib3.connectionpool]
the elapsed time in ps shows up as negative -- I see values like -14349.432432. Would that also impact garbage collection / task reaping?
I'm seeing a 500 error when I hit the poll method for one particular project. I'm not sure exactly what's going wrong, but I can investigate further. This happens for run_project tasks, but not run tasks. I think this has something to do with control characters in the log lines. Some of those are json-serializable, but others are not? This one in particular causes problems:

            {
                "timestamp": "2019-09-11T02:09:43.924055Z",
                "message": "�[33mWARNING: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.\nThere are 1 unused configuration paths:\n- models.client.attribution.transform\n�[0m",
                ...

^ this warning occurs during compilation, but it appears to be served up correctly by calls to status. I think this has something to do with logging because a poll request for this task will succeed unless I specify logs: true while the task is running. When the task completes, polling the task appears to always result in a 500 error. I saw this on BigQuery FWIW, but it may be project-specific. Thoughts: might be worth turning off colorama for structured logs; would also be good to use a big-ol try-catch so we knew where this 500 was coming from.

Let me have a think about garbage collection / reaping / the gc endpoint / etc. Will follow up here with some additional thoughts. Overall though, you did a really incredible job here @beckjake :)

drewbanin · 2019-09-11T00:57:12Z

core/dbt/rpc/task_handler.py

-def _task_bootstrap(task, queue, kwargs):
+def _task_bootstrap(
+    task: RPCTask,
+    queue,  # typing: Queue[Tuple[QueueMessageType, Any]]


what's the difference between this and the other type hints?

This is a very annoying one. Basically, the built-in stdlib queue.Queue doesn't have any type-related information attached to it. So typeshed, which is included in mypy, provides its own queue.Queue for mypy to use that does support type information. It's all very silly, as it means that Queue[...] isn't valid syntax so I can't use the name: Type syntax.

roger! Thanks for the explanation

beckjake · 2019-09-11T13:33:56Z

it looks like some types of request validation errors only show up in the poll method, but I would expect them to fail at request-time. One such example: submitting a run task without specifying a name. I'd expect the request to fail, but it's accepted and the failure is only apparent when we poll for the request ("handle_request() missing 1 required positional argument: 'name'"). Is this a minor change, or is there a good reason why it should work the way it's implemented here?

It's not a minor change really, as tasks define their arguments in handle_request, which is what the RPC server calls at runtime. To fix this, dbt rpc tasks would have to be implemented in two stages: one that takes the arguments and does some level of validation, and one that does the actual work.

beckjake · 2019-09-11T13:53:24Z

the elapsed time in ps shows up as negative -- I see values like -14349.432432. Would that also impact garbage collection / task reaping?

Oops, this is a timezone bug! I think it would, yes.

beckjake · 2019-09-11T14:42:59Z

I'm seeing a 500 error when I hit the poll method for one particular project. I'm not sure exactly what's going wrong, but I can investigate further. This happens for run_project tasks, but not run tasks. I think this has something to do with control characters in the log lines.

I can't reproduce this issue. I made a project that emits the same warning by adding a dummy dbt_project.yml line, and then did dbt rpc, started the task with run_project, and hit the poll endpoint repeatedly. Is there more to this?

drewbanin · 2019-09-11T14:43:00Z

To fix this, dbt rpc tasks would have to be implemented in two stages: one that takes the arguments and does some level of validation, and one that does the actual work

Ok! Let's do that in this PR

drewbanin

This LGTM once the tests are passing! The new validation code works great :)

Add __len__, __iter__, and __getitem__ to ExecutionResult/FreshnessResult - tests and some internal callers are now happy without modifying them

Also disable RPC tests on windows managing windows idiosyncrasies is just too much

Pass werkzeug logging through suppress all non-dbt logs in the RPC response filter out all non-dbt/werkzeug logs below WARNING (dbt and dbt rpc) lower all remaining non-dbt/werkzeug logs to DEBUG (dbt and dbt rpc) - attach old_level in extras if we do Handle some more protocol exceptions gracefully

If the rpc server gets a non-serializable response component, call str() on it

cla-bot bot added the cla:yes label Sep 10, 2019

beckjake requested a review from drewbanin September 10, 2019 18:20

beckjake commented Sep 10, 2019

View reviewed changes

beckjake force-pushed the feature/async-rpc branch from 15596b9 to d4fe9b7 Compare September 10, 2019 20:02

drewbanin reviewed Sep 11, 2019

View reviewed changes

beckjake force-pushed the feature/async-rpc branch 2 times, most recently from 17da8ac to af5e227 Compare September 11, 2019 17:25

drewbanin approved these changes Sep 12, 2019

View reviewed changes

Jacob Beck added 10 commits September 12, 2019 14:17

Clean up return types and return values for async rpc work + mypy

d23b1aa

Add __len__, __iter__, and __getitem__ to ExecutionResult/FreshnessResult - tests and some internal callers are now happy without modifying them

Implement async RPC (no gc of tasks)

4cb4318

gc, tests

5e52c64

remove windows support from RPC

d51ebac

Also disable RPC tests on windows managing windows idiosyncrasies is just too much

Handle started/ended correctly

3498065

bigquery error handling: log strings, not Exception objects

5937c20

More flexible json.dumps in rpc

0563061

If the rpc server gets a non-serializable response component, call str() on it

flake8

9ea6e8d

add early error on invalid parameters via inspect shenanigans

2cf2c82

beckjake force-pushed the feature/async-rpc branch from b1a89e4 to 2cf2c82 Compare September 12, 2019 20:17

beckjake merged commit 5283002 into dev/louisa-may-alcott Sep 13, 2019

beckjake deleted the feature/async-rpc branch September 13, 2019 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: async rpc (#1706) #1735

Feature: async rpc (#1706) #1735

beckjake commented Sep 10, 2019

beckjake Sep 10, 2019 •

edited

Loading

beckjake Sep 10, 2019

drewbanin left a comment

drewbanin Sep 11, 2019

beckjake Sep 11, 2019 •

edited

Loading

drewbanin Sep 11, 2019

beckjake commented Sep 11, 2019

beckjake commented Sep 11, 2019

beckjake commented Sep 11, 2019 •

edited

Loading

drewbanin commented Sep 11, 2019

drewbanin left a comment

Feature: async rpc (#1706) #1735

Feature: async rpc (#1706) #1735

Conversation

beckjake commented Sep 10, 2019

beckjake Sep 10, 2019 • edited Loading

Choose a reason for hiding this comment

beckjake Sep 10, 2019

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

drewbanin Sep 11, 2019

Choose a reason for hiding this comment

beckjake Sep 11, 2019 • edited Loading

Choose a reason for hiding this comment

drewbanin Sep 11, 2019

Choose a reason for hiding this comment

beckjake commented Sep 11, 2019

beckjake commented Sep 11, 2019

beckjake commented Sep 11, 2019 • edited Loading

drewbanin commented Sep 11, 2019

drewbanin left a comment

Choose a reason for hiding this comment

beckjake Sep 10, 2019 •

edited

Loading

beckjake Sep 11, 2019 •

edited

Loading

beckjake commented Sep 11, 2019 •

edited

Loading