Use Apache Commons CSV for parsing external secondary instances #644

lognaturel · 2021-09-09T22:13:48Z

Closes #616

Replaces #628

What has been done to verify that this works as intended?

Added tests for both comma-separated and semicolon-separated external secondary instances that have quotes, missing values, etc. I've also tried this in Collect with https://github.com/lognaturel/collect/tree/jr-616

Why is this the best possible solution? Were any other approaches considered?

Using an external library means we don't have to think through all the cases. commons-csv is broadly used. We had to use v1.4 or prior in order to support a minsdk of 21. If I'd thought of it sooner, I would have suggested we look into opencsv which Collect uses but I don't think it's a big deal that the libs aren't the same.

Because Excel exports CSVs with semicolons for certain locales, I think it's good to support that case. The best way I could come up with to do that is to check whether the header contains a semicolon.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

This should only expand the kind of external secondary instance CSVs that can be successfully parsed. I can't think of a regression risk. Risk would be isolated to parsing external secondary CSVs.

Do we need any specific form for testing your changes? If so, please attach one.

See issue or use any form that uses a secondary instance such as https://github.com/getodk/javarosa/blob/3f1060ce34a2155856d0f819ce42451897e428a4/src/test/resources/org/javarosa/core/model/instance/external-secondary-comma-complex.csv

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No.

seadowg · 2021-09-15T15:16:58Z

src/test/java/org/javarosa/core/model/instance/CsvExternalInstanceTest.java

+        }
+    }
+
+    @Before


The @Before at the bottom really threw me off. Obviously still works here, but I think for readability it should move up to the top.

seadowg · 2021-09-15T15:24:01Z

src/main/java/org/javarosa/core/model/instance/CsvExternalInstance.java

+        try (BufferedReader reader = new BufferedReader(new FileReader(path))) {
+            String header = reader.readLine();
+
+            if (header.contains(";")) {


I guess this means a CSV like hello; world, blah would have columns hello and world, blah rather than hello; world and blah (which I might have meant). I think that's ok, as parsing these things with more than one delimiter and always getting it right is impossible, but just thinking out loud.

I meant to call this out, sorry! It feels like having punctuation in column headers would be very unusual. I can kind of imagine it with commas maybe? Something cruel like a column header of "First, Last Name"? But I really can't imagine using semicolons in a column header so that's why it felt safest to explicitly test for semicolon. If we really wanted to be fancy we could try to account for quotes in headers but that really felt like overkill. I'd like to go with the limitation you pointed out and adjust if someone has a legitimate usecase for a semicolon in the header (please, no!).

Which reminds me of something else I meant to mention -- I briefly considered also supporting tabs but again, I think we should wait until someone explicitly asks for it. The C in CSV is comma.

The C in CSV is comma.

This.

seadowg · 2021-09-15T15:29:01Z

src/test/java/org/javarosa/xform/parse/ExternalSecondaryInstanceParseTest.java

@@ -184,14 +183,14 @@ public void realInstanceIsResolved_whenFormIsDeserialized_afterPlaceholderInstan

    @Test
    // Clients would typically catch this exception and try parsing the form again which would succeed by using the placeholder.
-    public void fileNotFoundException_whenFormIsDeserialized_afterPlaceholderInstanceUsed_andFileStillMissing() throws IOException, DeserializationException {
+    public void ioException_whenFormIsDeserialized_afterPlaceholderInstanceUsed_andFileStillMissing() throws DeserializationException {


I might just be being slow today, but I'm missing the reason for this change?

Good catch! In versions 1.6+ of the lib some other IOException exception is thrown. It's not in Android 21 so that's why I had to downgrade the library further. Forgot to clean this up. I rebased -- hopefully you won't find it too hard to follow.

seadowg · 2021-09-15T15:33:35Z

src/test/java/org/javarosa/core/model/instance/CsvExternalInstanceTest.java

+import org.junit.Before;
+import org.junit.Test;
+
+public class CsvExternalInstanceTest {


I think having multiple cases in two separate files made this a little hard to follow. I kept having to jump around to see what was really being tested. Not a big deal, but maybe a tweak would be to just have smaller CSVs inline and written to a file that gets parsed in the tests or just smaller more specific CSV test resources? Maybe I'm missing the point, and we need multiple cases in one file.

Don't disagree with you. I used the existing tests to get to the coverage I wanted as quickly as possible. I split the window vertically to see the test and resource side by side. If you feel strongly about this, can you please push a commit? I doubt we'll need to look at these tests again any time soon so only if it's really important to you to spend the extra time!

I went to spend 10 mins doing this but caught up with some caching problem in IntelliJ so will leave it for the moment. I did a quick rename though: I think part of my problems reading the test were probably due to the tabSeprated field, which was the more obvious problem we both missed!

Excel export format for CSV is based on the document locale. For some locales such as French/France, semicolon is used as the delimiter.

seadowg

Good to merge if you're happy with my tweak

lognaturel · 2021-09-17T20:10:30Z

Oh dear. Thank you and I’m sorry. 🤪

Create a test to reveal parsing defects

b74d9b4

lognaturel mentioned this pull request Sep 9, 2021

616 CSV parsing improvements-WIP #628

Closed

lognaturel marked this pull request as ready for review September 13, 2021 20:00

lognaturel requested a review from seadowg September 13, 2021 20:00

seadowg requested changes Sep 15, 2021

View reviewed changes

lognaturel force-pushed the issue-616 branch from 3f1060c to c2d206c Compare September 15, 2021 20:35

dcbriccetti and others added 2 commits September 15, 2021 13:43

Replace hand-written CSV parser with Apache Commons CSV

fae6c7c

Support semicolon as delimiter

d176628

Excel export format for CSV is based on the document locale. For some locales such as French/France, semicolon is used as the delimiter.

lognaturel force-pushed the issue-616 branch from c2d206c to d176628 Compare September 15, 2021 20:44

lognaturel requested a review from seadowg September 15, 2021 20:48

Correct field name

829d1dd

seadowg force-pushed the issue-616 branch from 5e9fa5c to 829d1dd Compare September 16, 2021 09:10

seadowg approved these changes Sep 16, 2021

View reviewed changes

lognaturel merged commit d7e4616 into getodk:master Sep 17, 2021

lognaturel deleted the issue-616 branch September 17, 2021 20:09

lognaturel mentioned this pull request Oct 4, 2021

Update JR to 3.3.0 snapshot and update test for re-calculation in field list getodk/collect#4845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Apache Commons CSV for parsing external secondary instances #644

Use Apache Commons CSV for parsing external secondary instances #644

lognaturel commented Sep 9, 2021 •

edited

Loading

seadowg Sep 15, 2021

seadowg Sep 15, 2021

lognaturel Sep 15, 2021

lognaturel Sep 15, 2021

seadowg Sep 16, 2021

seadowg Sep 15, 2021

lognaturel Sep 15, 2021

seadowg Sep 15, 2021

lognaturel Sep 15, 2021

seadowg Sep 16, 2021

seadowg left a comment

lognaturel commented Sep 17, 2021

+                      }
+                  }
+                  @Before

Use Apache Commons CSV for parsing external secondary instances #644

Use Apache Commons CSV for parsing external secondary instances #644

Conversation

lognaturel commented Sep 9, 2021 • edited Loading

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Do we need any specific form for testing your changes? If so, please attach one.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seadowg left a comment

Choose a reason for hiding this comment

lognaturel commented Sep 17, 2021

lognaturel commented Sep 9, 2021 •

edited

Loading