-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heads-up on future of telemetry data layout in BigQuery #53
Comments
@jklukas Thanks for the notes, but I'm surprised by the way this is being communicated since this seems like a non-trivial project and the time-frame (3 weeks!) is a bit small to make sure things will continue to work. Is @rafrombrc and @jasonthomas aware about those changes?
Will those changes be communicated via the usual channels (e.g. fx-data-dev etc)? As maintainers of the Redash fork it's usually not our task to talk to query owners about upcoming data changes. Since your changes will have a user experience effect I would recommend to err on the side of over-communication to reduce the level of stress such subtle changes may introduce.
The preview functionality via tabledata.list isn't being used right now, I think. Can you clarify which API endpoints are considered "preview"? Here's the query runner being used: https://github.com/getredash/redash/blob/master/redash/query_runner/big_query.py FWIW we do have the ability to override functionality in an own query runner (in fact we're working on one in #38). Regarding the use of
FWIW, I'm trying to understand why there are separate views when only one is user-facing? Who is using the non-user-facing views? Can we prevent the non-user-facing views to appear in the API results by using permissions instead of having to hard-code schema information in a consumer of the views and doing the filtering on the client side? I'm asking since the implementation complexity to filter out specific tables from the schema browser isn't super high, but it's there and we've been burned with a non-trivial maintenance burden with Redash customizations. IOW I'm reluctant to implement something that can be easier be implemented on the API layer. E.g. do you expect STMO users to sometimes use non-user-facing views (_live, _stable and _derived)? |
@jezdez - I wanted to make sure we started a conversation about these things, but I hope I haven't induced too much panic. In general, we are expecting that these changes will be basically invisible to users, but I want to make sure we're not missing any details. The ideas here were advertised as proposals and shared with the overall data engineering team in the Monday meetings for feedback and discussion. @jasonthomas has been actively engaged in that feedback and @rafrombrc should also be at least marginally aware.
Indeed. And they'll be documented in docs.t.m.o. It's my expectation that we will keep derived-datasets and shared-prod projects compatible by duplicating views in both places, so users of Redash should never be aware of the data underneath moving. Redash itself shouldn't need any code changes in that sense; we will handle everything via backend views and then eventually by changing the BigQuery source configuration to point to the new location.
Yes, that point no longer applies for clustered tables. We have asked Google to update that documentation to make that distinction, and we have some relevant notes on query optimization in DTMO.
It's generally the underlying tables that are non-user-facing, and the SQL views are what we present to users.
There will probably be some cases where advanced users will want to query a _stable table for a particular version of a ping, or they might want to directly query _live tables for use cases that want to see near-realtime data rather than completed days. Now that I think more about it, the existing toggle in redash for hiding tables with a version suffix is likely sufficient here as well. The views we present in the unsuffixed datasets will mostly not have a version attached, but the tables in _live, _stable, and _derived will always have versions. So hiding tables/views with a version suffix will effectively hide everything in _live, _stable, and _derived. |
Looking at STMO now, I'm not seeing anything that looks like the "preview" that's available in BigQuery. I think I am misremembering a conversation about that functionality being added to STMO. I believe you are correct that it's the tabledata.list API that populates the "Preview" section of the BigQuery UI. So, to get back to your original concern:
I don't think I see anything that needs to change in STMO code to support these backend changes, and I apologize again for causing concern. I do want to better socialize the upcoming changes to see if there are more thoughts about how STMO can provide an even better experience for users interacting with BigQuery. |
Hey there... mostly I think you're right, @jklukas, that the preview functionality isn't in use. However, @emtwo has been working on code to generate sample data for each field to show up in the schema drawer, and I can't recall whether that was making use of the preview functionality. Marina, can you let us know if you think these changes will impact your data sample work? |
Unfortunately, we do use It's used to gather samples for tables, and I confirmed that we don't get a response for views. Instead we get this error Since we can use |
Yes, issuing a query like |
For doc days today, I put together a draft of dtmo updates to reflect the upcoming prod structure and give some more guidance about how to navigate the layout: mozilla/data-docs#302 That's a formalization of what wanted to convey in the original issue here. |
@jklukas since the deployment of this change: mozilla/redash@32c910a we now fallback to a One such reason is due to attempting to list a view, as I mentioned above. I have found a couple of views, e.g. There are, however, several views that still don't have data samples. This could potentially be caused by a couple of things:
|
Great point; most of these tables will need a filter on A complete solution might look like this:
|
Thanks for pointing out the I am not super familiar with the data though, and I wonder if filtering on the most recent date (yesterday) would be sufficient for all tables? Are there any old tables that have stopped being updated but we might still need samples for? What do you think @jklukas? |
It should be a minority of tables that aren't getting new data daily, but your are definitely right that this is an edge case that we'll need to handle. Because we're doing a LIMIT 1 here, we can be pretty loose about the partition filter. The following would still be efficient:
So that query effectively sidesteps the I was tempted to recommend grabbing the most recent row by putting an If we decided it was desirable to get a recent record, we could list partitions in the table in order to be sure that we're choosing the most recent non-empty date. |
Cool! I like your idea with Thank you! |
I wanted to summarize implications from recent and upcoming changes to the layout of telemetry data in BQ as it relates to STMO.
First, we are starting to migrate all data into the moz-fx-data-shared-prod project from the moz-fx-data-derived-datasets project. shared-prod will be the combined home for both incoming data from the telemetry pipeline and for derived datasets. As we transition derived data, we will plan to present views of that data in both derived-datasets and in shared-prod, so that STMO users can still access the tables as they move. Eventually (by end of August?), we should be able to switch STMO's BigQuery sources to point to shared-prod with no change to queries.
Second, we are planning to present all user-facing data as views rather than having users hit tables directly. This gives a number of benefits in terms of allowing reconfiguration of backend tables without changing the user interface and conversely allowing us to evolve the user interface without having to rewrite backend tables. This will mostly be transparent to users, since we are replacing tables with views that have an identical schema. One issue we're aware of, however, is that views don't have the same preview functionality as tables (see #52). As we move all tables to being clustered, however, it is now efficient to run
SELECT * LIMIT 10
-type queries to investigate content; this may be sufficient that loss of table preview is not terrible.Third, the overall data layout is becoming more complex. For each document namespace, we now have four separate BigQuery datasets. For telemetry, there are tables in telemetry_live, telemetry_stable, and telemetry_derived; then there are user-facing views in
telemetry
(unsuffixed). We will be producing documentation to explain the full function of each of these datasets, but for the vast majority of use cases, we will be expecting users to query only the views presented in the unsuffixed datasets (primarilytelemetry
). It may be useful to have an option in STMO to hide all suffixed datasets (those ending in _live, _stable, or _derived) so that users see only the views that are intended for end-user queries.The text was updated successfully, but these errors were encountered: