-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing glob support when reading files #223
Comments
@mtsargent you are not suppose to access parts of the file. This is general Hadoop eco-system usage. Parts are internal files, that were created by distributed tasks. You should never access parts directly, rather you need to use ("cos://some-bucket.myCos/somefile.csv") and then globber is supported of course. |
Fair point about part files, but would you anticipate the stocator globber to work with non-part files? Suppose I try to use this to read in multiple files:
Would you expect this to read in all of the following from my COS bucket?
While also ignoring other files. Example:
I suppose I can just set up this scenario and test it out. |
@mtsargent i expect exactly as you wrote. if this doesn't work, then it's a bug in Stocator and need to be fixed of course. |
@mtsargent however it's not clear how to make ranges in [x-y]...if it's numeric or literal is important to know. for example, [aaxy-xyba], what you expect to have? there might be thousands of objects, how to identify them? or you need only numeric, [1-100], will be 1,2,..,99,100? |
I think each expression in brackets only corresponds to a single character. The syntax I am familiar with is described here: http://man7.org/linux/man-pages/man7/glob.7.html. At the very least, I can set up this test next time I am around my work computer. I can update this issue one way or the other (and can close the issue if matching works as expected). |
@mtsargent thanks. I think we support {} right now, [] is not supported, but i need double check. At least i don't see unitests for [], only for {} https://github.com/CODAIT/stocator/blob/master/src/test/java/com/ibm/stocator/fs/cos/systemtests/TestCOSGlobberBracketStocator.java Will you be able to extend code to support also [] ? will be great if you can work on it.. |
This may be something I can try to take on. It likely wouldn't be for a few weeks at the earliest. |
When reading multiple files at once with Spark, I would expect to use wildcards/other general glob patterns (similar to the answer https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd/24036343). Example repeated here for simplicity:
When using Stocator, attempting to read files in this way fails:
val junkcsv = spark.sqlContext.read.option("header", "true").load("cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*")
This failure happens even when there are files I would expect to match that pattern like:
The lack of glob support seems to be coming from the ObjectStoreFlatGlobFilter class:
stocator/src/main/java/com/ibm/stocator/fs/common/ObjectStoreFlatGlobFilter.java
Lines 128 to 134 in c18f37b
The only type of matching attempted is a simple wildcard match, rather than an actual attempt at globbing.
The java.nio package may be able to support this type of matching. I have not yet built a custom version of Stocator, but the following matching code seems promising:
I am not familiar enough with the rest of the Stocator codebase to know if adding in this type of matching breaks other parts of the code drastically.
The text was updated successfully, but these errors were encountered: