Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds an ArrayDirectory class to manage all URIs within the array directory #2909

Merged
merged 1 commit into from
Feb 25, 2022

Conversation

stavrospapadopoulos
Copy link
Member

This PR adds an ArrayDirectory class to manage all URIs within the array directory. This introduces several performance improvements, especially around removing redundant URI listings, parallelizing URI listings, etc. It also paves the way for better format versioning, especially when we need to shuffle files around in the array directory for better performance in the future (there is an upcoming PR for that).

Notes:

  • The PR makes VFS::ls a noop for POSIX and HDFS when the listed directory does not exist instead of throwing an error, matching the functionality of the object stores.
  • The PR removes partial vacuuming, as that leads to incorrect behavior with time traveling. Vacuuming will be refactored soon in an upcoming PR as well, so there should be no issues here.
  • The PR adds a unit test file for ArrayDirectory, but the unit tests are missing. This is because at the moment the class just incorporated practically existing code (moved from StorageManager and optimized). If there is anything wrong with the class at the moment, all the tests will break (as it affects loading fragments, schemas, metadata, etc). Moreover, this class will be enhanced in an upcoming PR that will move all URI creations from the writer and array schema classes in ArrayDirectory. Therefore, we will add proper unit tests in that PR.

TYPE: IMPROVEMENT
DESC: Adds an ArrayDirectory class to manage all URIs within the array directory.

…tory. This introduces several performance improvements, especially around redundant URI listings, parallelizing URI listings, etc. Also makes VFS::ls a noop for POSIX and HDFS when the listed directory does not exist instead of throwing an error, matching the functionality of the object stores. Finally, it removes partial vacuuming, as that leads to incorrect behavior with time traveling.
/* API */
/* ********************************* */

const URI& ArrayDirectory::array_uri() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially rename to uri()

@Shelnutt2 Shelnutt2 merged commit d575c8a into dev Feb 25, 2022
@Shelnutt2 Shelnutt2 deleted the sp/array_directory branch February 25, 2022 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants