Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Apache Commons CSV for parsing external secondary instances #644

Merged
merged 4 commits into from
Sep 17, 2021

Conversation

lognaturel
Copy link
Member

@lognaturel lognaturel commented Sep 9, 2021

Closes #616

Replaces #628

What has been done to verify that this works as intended?

Added tests for both comma-separated and semicolon-separated external secondary instances that have quotes, missing values, etc. I've also tried this in Collect with https://github.com/lognaturel/collect/tree/jr-616

Why is this the best possible solution? Were any other approaches considered?

Using an external library means we don't have to think through all the cases. commons-csv is broadly used. We had to use v1.4 or prior in order to support a minsdk of 21. If I'd thought of it sooner, I would have suggested we look into opencsv which Collect uses but I don't think it's a big deal that the libs aren't the same.

Because Excel exports CSVs with semicolons for certain locales, I think it's good to support that case. The best way I could come up with to do that is to check whether the header contains a semicolon.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

This should only expand the kind of external secondary instance CSVs that can be successfully parsed. I can't think of a regression risk. Risk would be isolated to parsing external secondary CSVs.

Do we need any specific form for testing your changes? If so, please attach one.

See issue or use any form that uses a secondary instance such as https://github.com/getodk/javarosa/blob/3f1060ce34a2155856d0f819ce42451897e428a4/src/test/resources/org/javarosa/core/model/instance/external-secondary-comma-complex.csv

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No.

@lognaturel lognaturel marked this pull request as ready for review September 13, 2021 20:00
@lognaturel lognaturel requested a review from seadowg September 13, 2021 20:00
}
}

@Before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @Before at the bottom really threw me off. Obviously still works here, but I think for readability it should move up to the top.

try (BufferedReader reader = new BufferedReader(new FileReader(path))) {
String header = reader.readLine();

if (header.contains(";")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means a CSV like hello; world, blah would have columns hello and world, blah rather than hello; world and blah (which I might have meant). I think that's ok, as parsing these things with more than one delimiter and always getting it right is impossible, but just thinking out loud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to call this out, sorry! It feels like having punctuation in column headers would be very unusual. I can kind of imagine it with commas maybe? Something cruel like a column header of "First, Last Name"? But I really can't imagine using semicolons in a column header so that's why it felt safest to explicitly test for semicolon. If we really wanted to be fancy we could try to account for quotes in headers but that really felt like overkill. I'd like to go with the limitation you pointed out and adjust if someone has a legitimate usecase for a semicolon in the header (please, no!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which reminds me of something else I meant to mention -- I briefly considered also supporting tabs but again, I think we should wait until someone explicitly asks for it. The C in CSV is comma.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C in CSV is comma.

This.

@@ -184,14 +183,14 @@ public void realInstanceIsResolved_whenFormIsDeserialized_afterPlaceholderInstan

@Test
// Clients would typically catch this exception and try parsing the form again which would succeed by using the placeholder.
public void fileNotFoundException_whenFormIsDeserialized_afterPlaceholderInstanceUsed_andFileStillMissing() throws IOException, DeserializationException {
public void ioException_whenFormIsDeserialized_afterPlaceholderInstanceUsed_andFileStillMissing() throws DeserializationException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just be being slow today, but I'm missing the reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! In versions 1.6+ of the lib some other IOException exception is thrown. It's not in Android 21 so that's why I had to downgrade the library further. Forgot to clean this up. I rebased -- hopefully you won't find it too hard to follow.

import org.junit.Before;
import org.junit.Test;

public class CsvExternalInstanceTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having multiple cases in two separate files made this a little hard to follow. I kept having to jump around to see what was really being tested. Not a big deal, but maybe a tweak would be to just have smaller CSVs inline and written to a file that gets parsed in the tests or just smaller more specific CSV test resources? Maybe I'm missing the point, and we need multiple cases in one file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't disagree with you. I used the existing tests to get to the coverage I wanted as quickly as possible. I split the window vertically to see the test and resource side by side. If you feel strongly about this, can you please push a commit? I doubt we'll need to look at these tests again any time soon so only if it's really important to you to spend the extra time!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to spend 10 mins doing this but caught up with some caching problem in IntelliJ so will leave it for the moment. I did a quick rename though: I think part of my problems reading the test were probably due to the tabSeprated field, which was the more obvious problem we both missed!

dcbriccetti and others added 2 commits September 15, 2021 13:43
Excel export format for CSV is based on the document locale. For some locales such as French/France, semicolon is used as the delimiter.
Copy link
Member

@seadowg seadowg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge if you're happy with my tweak

@lognaturel lognaturel merged commit d7e4616 into getodk:master Sep 17, 2021
@lognaturel lognaturel deleted the issue-616 branch September 17, 2021 20:09
@lognaturel
Copy link
Member Author

Oh dear. Thank you and I’m sorry. 🤪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support quotes, escaped commas, and other CSV formatting in external CSV files
3 participants