Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the SasFileParser() API Public #51

Closed
thesuperzapper opened this issue Jul 23, 2019 · 11 comments
Closed

Make the SasFileParser() API Public #51

thesuperzapper opened this issue Jul 23, 2019 · 11 comments

Comments

@thesuperzapper
Copy link

Hi Guys, I am the maintainer of the spark-sas7bdat package, which is used by many to read large SAS7BDAT with Apache Spark across many servers.

The issue is, because we need to start reading arbitrary offsets of the sas files (so each of the servers only reads a piece of the file), we need to call some of the protected methods/constructors in Parso.

Currently we use this crazy hack (PrivateMethodExposer.scala) to break the into protected methods and constructors.

I would love it if you could publicly expose the following things:

  1. The constructor for com.epam.parso.impl.SasFileParser
  2. The getSasFileProperties() method of SasFileParser
  3. The readNext() method of SasFileParser
  4. The readNextPage() method of SasFileParser
  5. The constants: TIME_FORMAT_STRINGS, DATE_FORMAT_STRINGS, and EPSILON

Here is the actual class we use in conjunction with the PrivateMethodExposer.scala, so we can use com.epam.parso.impl.SasFileParser: ParsoWrapper.scala

@printsev
Copy link
Contributor

Hi Mathew,

Thanks for sharing your issue. Could you please answer the following questions so we can find out the best way to help you.

  1. Would it be possible for you to use SasFileReaderImpl?
  2. Why do you need readNextPage() to be public as this is internal method for file parsing?

Thanks

@thesuperzapper
Copy link
Author

@printsev
This is the process we currently use to distribute the reading of sas files across multiple servers:

  1. Each worker is provided a rough "start point" and "end point" in byte offsets from the beginning of the file. (These are coordinated by a master node and have roughly "File Size"/"Num workers" bytes)
  2. Each worker initialises a com.epam.parso.impl.SasFileParser.
    1. With the input stream starting from byte 0, so that it can read the metadata in the header.
  3. Each worker moves its "start point" and "end point" backwards such that they sit on the closest preceding page end.
  4. Each worker seeks the SasFileParser's input stream to the "start point"
    1. After this we call readNextPage() so that the internals of Parso get reset to the new offset.
  5. Each worker reads new rows until the input stream is at "end point" offset.

Here is the code, which currently uses hacks to expose the private methods described above.

@Yana-Guseva
Copy link
Collaborator

Hi @thesuperzapper,

Will you be comfortable if we provide public access to readNextPage() method via SasFileReaderImpl class? This class already contains public access methods to getSasFileProperties() and readNextPage() which you use in your code. Constants like TIME_FORMAT_STRINGS, DATE_FORMAT_STRINGS, and EPSILON was moved to public interfaces (will soon be released), in this way it looks like we can add public void readNextPage() method to SasFileReaderImpl and you can use it instead of creating instance of SasFileParser directly. Or there are any other reasons why you need instance of SasFileParser class? Thank you.

@thesuperzapper
Copy link
Author

@Yana-Guseva That would probably work.

@thesuperzapper
Copy link
Author

@Yana-Guseva any progress on this?
We would love to make spark-sas7bdat use this new public API.

@Yana-Guseva
Copy link
Collaborator

@thesuperzapper currently all changes related to this issue are available in the master branch. Please let me know if you run into any problems.

@thesuperzapper
Copy link
Author

@Yana-Guseva @printsev While these changes are now in the master branch, no release has happened in many months.

When are you planning to cut a release with these changes? (the spark-sas7bdat package needs this change urgently to support Spark 3.0)

@Tagar
Copy link

Tagar commented Aug 18, 2020

@printsev @Yana-Guseva any chance a new release can be cut from master? thx!!

@printsev
Copy link
Contributor

Sorry for some delay with my answer -- vacation time (even the year is absolutely crazy). I've deployed 2.0.12-SNAPSHOT to maven snapshot. Hope it's OK for now, and after we deal with some failing tests (I believe it's not the application as they are failing even with code from 2016), we can make a 2.0.12 release. Hope it's OK.

@Tagar
Copy link

Tagar commented Aug 19, 2020

Thank you Igor

@printsev
Copy link
Contributor

I've made the 2.0.12 release, please let me know if it works for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants