Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does geni support reading directly from an HDFS path? #228

Closed
aaelony-catasys opened this issue Oct 4, 2020 · 8 comments
Closed

Does geni support reading directly from an HDFS path? #228

aaelony-catasys opened this issue Oct 4, 2020 · 8 comments

Comments

@aaelony-catasys
Copy link

Does geni support reading directly from an HDFS path?

Is there something akin to the following?

(def df (read-from-hdfs "/some/path/on/hdfs/to/a/subdir/"))

... where /some/path/on/hdfs/to/a/subdir/ is a path on hdfs that contains many files?

thanks in advance.

@aaelony-catasys
Copy link
Author

Actually, perhaps

(def df (read-csv! "hdfs://some/path/on/hdfs/to/a/subdir/one-of-the-files.csv"))

is what I need, if I can resolve an error message of:

Execution error (UnknownHostException) at org.apache.hadoop.security.SecurityUtil/buildTokenService (SecurityUtil.java:378).

NIce project!

@behrica
Copy link
Contributor

behrica commented Oct 4, 2020

I got it working for "wasb://" URLS to read from Azure Blob Storage.

This required quite some digging, on which jars to add and which configuration options to pass.
I suppose that working with HDFS might be similar.

@anthony-khong
Copy link
Member

Hi @aaelony-catasys, thank you for raising the issue! I believe this should be possible. Geni is just handling Spark objects and calling Spark methods. If you see this article, it seems doable. Not sure if you are running to this issue though.

@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄

@aaelony-catasys
Copy link
Author

Actually, I think the issue is kerebos related. Trying to identify how to properly kinit from within geni.

@anthony-khong
Copy link
Member

Hi @aaelony-catasys, are you still having issues with reading from an HDFS path?

@aaelony-catasys
Copy link
Author

Hi @anthony-khong, it is a kerberos issue that I haven't had the chance to look into in depth. I did find a few urls to research here and here but I don't know anything about kerberos so it might be a while before I can go the route of geni until I can get this resolved.

@aaelony-catasys
Copy link
Author

Hi @anthony-khong, it will take me some time to get up to speed on kerberos, spark, docker ports, etc and how they interrelate. Unfortunately, I don't have spare cycles to devote time to this in the near term. You might wish to close this ticket for the time-being.

Best regards

@gnarroway
Copy link

Just making a note that I got this working today in the repl against a kerberized HDFS so definitely possible. I’ll compile some notes after I make sure everything works as a deployed job too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants