Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing glob support when reading files #223

Open
mtsargent opened this issue Oct 25, 2019 · 7 comments
Open

Missing glob support when reading files #223

mtsargent opened this issue Oct 25, 2019 · 7 comments

Comments

@mtsargent
Copy link

When reading multiple files at once with Spark, I would expect to use wildcards/other general glob patterns (similar to the answer https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd/24036343). Example repeated here for simplicity:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

When using Stocator, attempting to read files in this way fails:
val junkcsv = spark.sqlContext.read.option("header", "true").load("cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*")

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: cos://some-bucket.myCos/somefile.csv/part-0000[0-1]*;

This failure happens even when there are files I would expect to match that pattern like:

cos://some-bucket.myCos/somefile.csv/part-00000.csv
cos://some-bucket.myCos/somefile.csv/part-00001.csv

The lack of glob support seems to be coming from the ObjectStoreFlatGlobFilter class:

if (name != null && name.startsWith("part-")) {
LOG.trace("accept on parent {}, path pattern {}",
path.getParent().toString(), pathPattern);
match = FilenameUtils.wildcardMatch(path.getParent().toString() + "/", pathPattern);
} else {
match = FilenameUtils.wildcardMatch(pathStr, pathPattern);
}

The only type of matching attempted is a simple wildcard match, rather than an actual attempt at globbing.

The java.nio package may be able to support this type of matching. I have not yet built a custom version of Stocator, but the following matching code seems promising:

PathMatcher pm = FileSystems.getDefault().getPathMatcher("glob:" + pathPattern.replaceAll("//", "/"));
Path newPath = FileSystems.getDefault().getPath(pathStr);

match = pm.matches(newPath);

I am not familiar enough with the rest of the Stocator codebase to know if adding in this type of matching breaks other parts of the code drastically.

@gilv
Copy link
Contributor

gilv commented Oct 26, 2019

@mtsargent you are not suppose to access parts of the file. This is general Hadoop eco-system usage. Parts are internal files, that were created by distributed tasks. You should never access parts directly, rather you need to use ("cos://some-bucket.myCos/somefile.csv") and then globber is supported of course.

@mtsargent
Copy link
Author

Fair point about part files, but would you anticipate the stocator globber to work with non-part files?

Suppose I try to use this to read in multiple files:

"cos://some-bucket.myCos/file-00[0-2]*"

Would you expect this to read in all of the following from my COS bucket?

file-000.txt
file-001.txt
file-002.txt

While also ignoring other files. Example:

file-003.txt
file-004.txt

I suppose I can just set up this scenario and test it out.

@gilv
Copy link
Contributor

gilv commented Oct 26, 2019

@mtsargent i expect exactly as you wrote. if this doesn't work, then it's a bug in Stocator and need to be fixed of course.

@gilv
Copy link
Contributor

gilv commented Oct 26, 2019

@mtsargent however it's not clear how to make ranges in [x-y]...if it's numeric or literal is important to know. for example, [aaxy-xyba], what you expect to have? there might be thousands of objects, how to identify them? or you need only numeric, [1-100], will be 1,2,..,99,100?

@mtsargent
Copy link
Author

I think each expression in brackets only corresponds to a single character. The syntax I am familiar with is described here: http://man7.org/linux/man-pages/man7/glob.7.html.
[aaxy-xyba] would be the same as a single character match out of [abxy], and [1-100] would be a single character match the same as writing [01] or [0-1].

At the very least, I can set up this test next time I am around my work computer. I can update this issue one way or the other (and can close the issue if matching works as expected).

@gilv
Copy link
Contributor

gilv commented Oct 27, 2019

@mtsargent thanks. I think we support {} right now, [] is not supported, but i need double check. At least i don't see unitests for [], only for {} https://github.com/CODAIT/stocator/blob/master/src/test/java/com/ibm/stocator/fs/cos/systemtests/TestCOSGlobberBracketStocator.java

Will you be able to extend code to support also [] ? will be great if you can work on it..

@mtsargent
Copy link
Author

This may be something I can try to take on. It likely wouldn't be for a few weeks at the earliest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants