-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make MAXIMUM_SEED_SIZE_MIB configurable #7125
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR. CLA has not been signed by users: @noppaz |
da6c118
to
880b035
Compare
I added a PR against your PR for changes that I'd recommend. Added change for compatibility and merging our changes. |
* Make MAXIMUM_SEED_SIZE configurable * Updated MAXIMUM_SEED_SIZE_NAME * Added changie * Update core/dbt/constants.py Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> * Adding suggested change * Fixed comparison * Added comment * Added constants change and removed changie --------- Co-authored-by: Doug Beatty <44704949+dbeatty10@users.noreply.github.com>
This looks good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @noppaz! A few comments after a first read through the code.
core/dbt/constants.py
Outdated
MAXIMUM_SEED_SIZE_NAME = "1MB" | ||
|
||
def get_max_seed_size(): | ||
mx = os.getenv("DBT_MAXIMUM_SEED_SIZE", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that this config's name include the unit: DBT_MAXIMUM_SEED_SIZE_MIB
Rather than accessing an env var directly here, this should really be implemented as a "global config" (available in our Flags
module). I don't think it would make very much sense to pass as a CLI flag (--maximum-seed-size
), but I could very much see someone wanting to set this in "user config":
# profiles.yml -- for now
config:
maximum_seed_size_mb: 5
That means:
- Implement it as a param (= CLI flag + env var)
- Add it to UserConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed this in fa836bb. It required some additional moving of the usage of the global flag as it is not a constant anymore and the constants.py file is referenced much earlier in the import cycle.
core/dbt/constants.py
Outdated
DEFAULT_MAXIMUM_SEED_SIZE = 1 * 1024 * 1024 | ||
MAXIMUM_SEED_SIZE = get_max_seed_size() * DEFAULT_MAXIMUM_SEED_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need both of these constants defined. MAXIMUM_SEED_SIZE
should just be get_max_seed_size() * 1024 * 1024
. The get_max_seed_size()
function will return the default value of 1
if the user does not set/override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking @jtcohen6!
One reason for keeping DEFAULT_MAXIMUM_SEED_SIZE
would be to be able to keep the logic of load_seed_source_file()
in read_files.py
. Since we do not convert the file from bytes in the new incremental logic we opted to continue computing the checksum with the "old" method for files smaller than 1 MiB so that we'd ensure backward compatibility with current manifests. This is since the checksum of a file encoded as anything other than UTF-8 will be different depending on if it gets UTF-8 encoded or is read as raw bytes.
The reason for not converting from bytes to UTF-8 string is performance and somewhat cleaner code but if we want to remove DEFAULT_MAXIMUM_SEED_SIZE
maybe we should remove the elif and only compute the seed file hash incrementally instead. I see two options then:
- Read the file as string and ensure UTF-8 encoding before hashing
- Continue reading the raw bytes and accept that some projects with seed files that are not utf-8 might have a diff in their manifest deferral when upgrading to this version of dbt-core.
What's your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented this suggestion with removal of DEFAULT_MAXIMUM_SEED_SIZE
and went with alternative 1 above. The changes can be seen in 53021d9.
Will close and re-open to try to trigger the CI checks. |
@dbeatty10 thanks for the tests, I've made adjustments and I'm able to get them passing locally now so I think another round of tests and eyes on the code would be the next step. |
I see that the tests are failing on Windows only now. I think the issue is related to this comment. However, this shines some light on what is kind of weird to me; the expected checksum is different on Windows and Linux/MacOS as can be seen in the tests from c2e2958. Should we not generate the same checksum so manifests can be correctly compared across platforms? |
@noppaz @dbeatty10 Currently our CI/CD doesn't work because we have one seed bigger than 1MiB, so we would need this. Is there any intention of finishing it up? Thank you very much! |
Hey @JavierLopezT, with no intention of sounding bitter, I gave up on this work due to low interest from dbt Labs. Would be happy to finish it up if they'd be up for it. The problem remains for myself as well. |
resolves #7117
resolves #7124
Description
Increasing the maximum size of seed files where the content is hashed for state comparison will enable a greater use of deferred runs with updated seed file contents. By making this a configuration with an environment variable users are able to override the default 1 MiB limit.
Furthermore, to support reading larger files in memory constrained environments, a new method is added to read and compute file contents incrementally. As this is performed on bytes for better performance we continue using the previous UTF-8 content method for small files to not mess up current states where seed files are not stored with UTF-8.
Checklist
changie new
to create a changelog entry