-
Notifications
You must be signed in to change notification settings - Fork 37
Conversation
143f644
to
bb7376a
Compare
Quick review notes:
|
I'm wondering if you wouldn't also want to generate a unique ID for each execution to tie start and end events together. It would probably simplify data analysis, but also help to find started but unfinished runs. |
Good idea! Can add it. |
75eae71
to
c32920a
Compare
@elshize would you mind taking this branch for a spin please? Otherwise I just need to add tests and this PR is good to go. |
@skrawcz I updated Hamilton version in my projet and ran my pipeline, worked fine, and got these messages:
|
e3ea2d9
to
4b2e1d6
Compare
`async` is a deprecated python library that hasn't been updated since 2014.
We do not want to show case `raw_execute` anywhere. People should be using `execute()` exclusively. So fixing this example to show how to get a dictionary back, rather than using `raw_execute`. Also added documentation to show you can instantiate it in a function, or outside in the module.
4b2e1d6
to
464956c
Compare
96f3af3
to
a33119f
Compare
a33119f
to
4316e01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits but looks good
After this change, by default, when using Hamilton, it will collect anonymous usage data to help us improve Hamilton and know where to apply development efforts. We capture two events: one when a driver object is instantiated, and one when the `execute()` call on the driver completes. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to: * Operating System and Python version * A persistent UUID to indentify the session, stored in ~/.hamilton.conf. * Error stack trace limited to Hamilton code, if one occurs. * Information on what features you're using from Hamilton: decorators, adapters, result builders. * How Hamilton is being used: number of final nodes in DAG, number of modules, size of objects passed to `execute()`. If you do not wish to participate, one can opt-out with one of the following methods: 1. Set it to false programmatically in your code before creating a Hamilton driver: ```python from hamilton import telemetry telemetry.disable_telemetry() ``` 2. Set the key `telemetry_enabled` to `false` in ~/.hamilton.conf under the `DEFAULT` section: ``` [DEFAULT] telemetry_enabled = True ``` 3. Set HAMILTON_TELEMETRY_ENABLED=false as an environment variable. Either setting it for your shell session: ```bash export HAMILTON_TELEMETRY_ENABLED=false ``` or passing it as part of the run command: ```bash HAMILTON_TELEMETRY_ENABLED=false python NAME_OF_MY_DRIVER.py ``` Otherwise, this commit is a large one, it: * adds a telemetry.py that handles the schema, sending logic, and related logic for capturing telemetry about hamilton usage. Note: we stop capturing after 1000 checks for is_telemetry_enabled to handle the case someone is doing something in bulk; we likely don’t care too much pass 1000 invocation. It also creates a thread that sends the telemetry; this should work in all contexts. We did not want to pull in any other python dependences, so that’s why we’re using urllib. * makes the two Drivers (regular, and async) orchestrate the logic to capture telemetry. So we will only capture telemetry if people are using the standard drivers. Rather than instrumentation graph, I think driver is the better place for it, since that’s where all the context is. * we add some global state to capture decorator usage and expose it via the graph object. This felt like the most natural way to do it. * adds tests and adjusts things to ensure telemetry is disabled for unit tests/circleci. Note: the sanitize error test depends on paths -- so circleci is the best place to ensure it works. We should fix this if it becomes an issue. * adds documentation on how to opt-out. —— Former commits that are being squashed: Adds async adapter telemetry unit test To ensure that the changes to the async driver work as expected. (+12 squashed commits) Squashed commits: [4f25e41] Adds unit tests for telemetry addition This fixes up a few functions and refactors them to be more easily unit testable. It also ensures that by default, telemetry is disabled for unit tests and circleci. [36e5a7e] Fixing doc strings [b0d4c4d] Refactors decorator counter methodology Now it's a decorator on the __call__ function. That way we decouple the logic for telemetry needs -- without it explicitly living within the NodeTransformLifecycle class. I mean it's still coupled, it's just we can now change that functionality more clearly. [57e209b] Adjust telemetry documentation and functions In response to PR comments. Adds some helper functions to make them easier to unit test. I put them in `telemetry.py` because they're static, and only relevant for telemetry, so it didn't seem too bad to put there... [1bda6a6] Fixes up imports to enable running driver.py as a script Legacy requirement. Just propagating it. [6f0c7b0] Adds telemetry tracking ability to async driver The async driver needs to have special casing to ensure it can also emit telemetry in an async friendly way. So added it to handle sending constructor and execute tracking that should not impact, for example, running within a fastapi webserver. [0bac34a] Wraps sending telemetry request in own thread For performance reasons we should spawn a thread to ensure we don't slow down an app's performance. [cba7bc3] Simplifies sanitize_error logic Removes unnecessary code, and makes the variable names a little easier to follow. [34e574e] Wraps sanitize_error in try except Since we don't want this code to cause a cryptic error message for the end user, so we wrap it in a try except. [f1d44b9] Adds usage and data privacy section to main README So that people know what we're doing and how to opt-out of it. [5ac73a9] Fixes to adjust pending changes to main [2a84121] Refactors and adds functionality This commit will be squashed in to the final, but it does the following: 1. Hooks up posthog to capture telemetry. They have a free tier that should be sufficient for our needs. 2. Refactors code into functions to enable better testing (TODO). 3. Adds logic to sanitize an error. We don't pull the name, just where in the hamilton code it runs from. This should suffice in helping us understand where people are encountering errors. 4. Adds logic to not capture custom code with respect to decorators and adapters. 5. Adds three ways to disable telemetry and documents it in the module. (+1 squashed commit) Squashed commits: [bb7376a] WIP sketch of telemetry This is just a rough sketch. It shows one way we might implement things. I.e. have it all be in the driver. So if someone is using their own custom driver, we would not get telemetry. AFAIK most people use the current driver. TODO: - actually check whether telemetry gathering is enabled - hook it up to something like posthog - test, test, test
For 3.7+ we can make it no depend on a sleep. For 3.6 I believe the easiest is to make it sleep.
4316e01
to
7951aec
Compare
We also should capture driver function invocation. This will be useful to know what functionality from the driver is being utilized, in addition to execution. E.g. this will help shed light on whether people are using the DAG and tags together for example. Useful data to know whether we should enhance capabilities there.
Async test was breaking because of the extra telemetry invocation. Adding mock context manager ensures that it is disabled for that one invocation. Otherwise permanently fixing the sanitize error issue by regex replacing the line number, since that will be changing depending on the number of lines in telemetry.py and we don't want people to have to update this test because of that.
This adds telemetry capture to Hamilton.
After this change, by default, when using Hamilton, it will collect anonymous usage data to help us improve Hamilton and know where to apply development efforts.
We capture two events: one when a driver object is instantiated, and one when the
execute()
call on the driver completes.No user data or potentially sensitive information is or ever will be collected. The captured data is limited to:
execute()
.If you do not wish to participate, one can opt-out with one of the following methods:
telemetry_enabled
tofalse
in ~/.hamilton.conf under theDEFAULT
section:export HAMILTON_TELEMETRY_ENABLED=false
Changes
How I tested this
Tested this locally, and added unit tests.
Notes
Checklist