-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple zone directories in single bucket, additional PubSub mechanism to provide near-realtime file detection and pipelining #85
base: master
Are you sure you want to change the base?
Conversation
Hey @gpacuilla thanks for the contribution! Can you please remove all of the semi-colons added by your linter and make a new commit? |
On my initial review, the PR you've provided seems too specific to your use case, and generates additional complexity for the majority of deployments. That said: reading your PR comments, I do agree that detecting subdirectories is an unmet need in the current integration. However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone. Why do you feel it is a better solution to buffer every filename into a PubSub queue? It seems to me that a better objective would be to find a way to query all relevant directories in the bucket |
Hey - thanks for taking a look! Apologies if the language used caused offense, which I did not intend. We've used this project for some time with great success and we do really appreciate all the work done here on this. At first I did try getting a list of the "prefixes" (sub-directories) from the
That worked OK and could return an array of the subdirs in the bucket, so it does seem possible to then iterate on the subdirs with the same methodology as seen in However, to clarify on what I meant by "janky" is that this code currently seems inflexible and makes some assumptions which I think could be wrong in certain situations.
I do agree that this adds additional complexity compared to Granted all of this talk of backfilling data is not perfect especially when using partitioned-by-ingestion-time BigQuery table, but the I hope my reasoning here makes any sense to you and appreciate your further consideration on this. |
"However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone." My two cents is that since there was no documentation of why the deadline time was being used as it was, it did seem "janky" to me in that it would have taken a lot of digging into how all the google libs worked to reverse-understand why it was done that way. I built a bit of a subdirectory crawler myself before I saw this PR, but I'd certainly say that the additional architectural complexity would be justified for me because it's easier to follow than the deadline time logic without additional documentation. |
I acknowledge that the rationale was not documented. In addition, I agree there are some architectural advantages to using a queue instead of a scheduler. However this was the solution we settled on, and we don't have the resources available to consider rearchitecting this right now. You are welcome to continue working/using a fork, but I don't want to make any assurances that we will integrate this proposal |
We had also recently run into the issue that was fixed in #70
However, the re-factoring of the code as part of that broke our particular setup which is:
Given the re-factoring in #70 we would have had to either, split the logs into different buckets, or deploy multiple functions per-subdirectory. However, doing either of those things would have re-introduced the BigQuery quota issue since we have enough zones that we would have exceeded the limit even on a per-minute schedule per-zone.
What I ended up doing here was re-introducing the "old" methodology of having a Cloud Function triggered by the GCS storage bucket
finalize
event, but we instead dump the detected filename into an additional PubSub topic so we can then separately pull all the newly detected files and load them into BigQuery immediately.This also removes the (in my opinion - janky) deadlineDate/deadlineDt logic that was also introduced.
This code may not be perfect for all use cases but I wanted to open this PR as an example for others, and perhaps with some tweaks we could make this the standard setup and merge into master?
Overall, I think the "lookback" deadline time logic is not ideal, and having a more realtime "pipeline" using separate cloud functions and PubSub seems to be more logical IMO.
Basic Overview of differences:
cf_logs_files
andevery_minute
cf_logs_file
runNewFiles
entrypoint which publishes tocf_logs_files
topic, triggered on-demand by GCS object finalize eventrunLoadJob
entrypoint which is triggered byevery_minute
PubSub Topic (same as before) - but also now subscribes to and consumescf_logs_files
, getting a list of file names and subsequently mapping toStorage.File
objects and doing batch loads viaBigQuery.createLoadJob
Also:
writeLog
function as it seemed unnecessarily redundant to normal Cloud Function stdout/stderr console logging