This repository has been archived by the owner on Jul 3, 2023. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds capturing telemetry to Hamilton
After this change, by default, when using Hamilton, it will collect anonymous usage data to help us improve Hamilton and know where to apply development efforts. We capture two events: one when a driver object is instantiated, and one when the `execute()` call on the driver completes. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to: * Operating System and Python version * A persistent UUID to indentify the session, stored in ~/.hamilton.conf. * Error stack trace limited to Hamilton code, if one occurs. * Information on what features you're using from Hamilton: decorators, adapters, result builders. * How Hamilton is being used: number of final nodes in DAG, number of modules, size of objects passed to `execute()`. If you do not wish to participate, one can opt-out with one of the following methods: 1. Set it to false programmatically in your code before creating a Hamilton driver: ```python from hamilton import telemetry telemetry.disable_telemetry() ``` 2. Set the key `telemetry_enabled` to `false` in ~/.hamilton.conf under the `DEFAULT` section: ``` [DEFAULT] telemetry_enabled = True ``` 3. Set HAMILTON_TELEMETRY_ENABLED=false as an environment variable. Either setting it for your shell session: ```bash export HAMILTON_TELEMETRY_ENABLED=false ``` or passing it as part of the run command: ```bash HAMILTON_TELEMETRY_ENABLED=false python NAME_OF_MY_DRIVER.py ``` Otherwise, this commit is a large one, it: * adds a telemetry.py that handles the schema, sending logic, and related logic for capturing telemetry about hamilton usage. Note: we stop capturing after 1000 checks for is_telemetry_enabled to handle the case someone is doing something in bulk; we likely don’t care too much pass 1000 invocation. It also creates a thread that sends the telemetry; this should work in all contexts. We did not want to pull in any other python dependences, so that’s why we’re using urllib. * makes the two Drivers (regular, and async) orchestrate the logic to capture telemetry. So we will only capture telemetry if people are using the standard drivers. Rather than instrumentation graph, I think driver is the better place for it, since that’s where all the context is. * we add some global state to capture decorator usage and expose it via the graph object. This felt like the most natural way to do it. * adds tests and adjusts things to ensure telemetry is disabled for unit tests/circleci. Note: the sanitize error test depends on paths -- so circleci is the best place to ensure it works. We should fix this if it becomes an issue. * adds documentation on how to opt-out. —— Former commits that are being squashed: Adds async adapter telemetry unit test To ensure that the changes to the async driver work as expected. (+12 squashed commits) Squashed commits: [4f25e41] Adds unit tests for telemetry addition This fixes up a few functions and refactors them to be more easily unit testable. It also ensures that by default, telemetry is disabled for unit tests and circleci. [36e5a7e] Fixing doc strings [b0d4c4d] Refactors decorator counter methodology Now it's a decorator on the __call__ function. That way we decouple the logic for telemetry needs -- without it explicitly living within the NodeTransformLifecycle class. I mean it's still coupled, it's just we can now change that functionality more clearly. [57e209b] Adjust telemetry documentation and functions In response to PR comments. Adds some helper functions to make them easier to unit test. I put them in `telemetry.py` because they're static, and only relevant for telemetry, so it didn't seem too bad to put there... [1bda6a6] Fixes up imports to enable running driver.py as a script Legacy requirement. Just propagating it. [6f0c7b0] Adds telemetry tracking ability to async driver The async driver needs to have special casing to ensure it can also emit telemetry in an async friendly way. So added it to handle sending constructor and execute tracking that should not impact, for example, running within a fastapi webserver. [0bac34a] Wraps sending telemetry request in own thread For performance reasons we should spawn a thread to ensure we don't slow down an app's performance. [cba7bc3] Simplifies sanitize_error logic Removes unnecessary code, and makes the variable names a little easier to follow. [34e574e] Wraps sanitize_error in try except Since we don't want this code to cause a cryptic error message for the end user, so we wrap it in a try except. [f1d44b9] Adds usage and data privacy section to main README So that people know what we're doing and how to opt-out of it. [5ac73a9] Fixes to adjust pending changes to main [2a84121] Refactors and adds functionality This commit will be squashed in to the final, but it does the following: 1. Hooks up posthog to capture telemetry. They have a free tier that should be sufficient for our needs. 2. Refactors code into functions to enable better testing (TODO). 3. Adds logic to sanitize an error. We don't pull the name, just where in the hamilton code it runs from. This should suffice in helping us understand where people are encountering errors. 4. Adds logic to not capture custom code with respect to decorators and adapters. 5. Adds three ways to disable telemetry and documents it in the module. (+1 squashed commit) Squashed commits: [bb7376a] WIP sketch of telemetry This is just a rough sketch. It shows one way we might implement things. I.e. have it all be in the driver. So if someone is using their own custom driver, we would not get telemetry. AFAIK most people use the current driver. TODO: - actually check whether telemetry gathering is enabled - hook it up to something like posthog - test, test, test
- Loading branch information