Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Differentiate between empty string and null value #81

Open
StefRe opened this issue Apr 4, 2021 · 5 comments · May be fixed by #125
Open

Feature request: Differentiate between empty string and null value #81

StefRe opened this issue Apr 4, 2021 · 5 comments · May be fixed by #125
Labels
enhancement New feature feature request New feature request

Comments

@StefRe
Copy link

StefRe commented Apr 4, 2021

readAllWithHeader yields a List<Map<String,String>> and hence empty columns are being read as empty strings, so that we get "" for both col1 and col2 in the following example:

"col1","col2"
"",

I'd really like to get null for col2 here (this of course only makes sense if all strings are quoted, otherwise it wouldn't be clear how to interpret empty columns). I understand that you can't change the result to List<Map<String,String?>> now, but maybe you could add a nullCode option for reading as it already exists for writing. The default value is an empty string "" (=current behavior). I could then simply do

val nullCode = "NULL"
val rows = csvReader(nullCode=nullCode).readAllWithHeader(inputStream)
    .map { row -> row.mapValues { col -> if (col.value == nullCode) null else col.value } }

At first glance it seems that it only requires to change
https://github.com/doyaaaaaken/kotlin-csv/blob/c23a51b98fbbcb1348b3928d561f1f5e11fba965/src/commonMain/kotlin/com/github/doyaaaaaken/kotlincsv/parser/ParseStateMachine.kt#L36-L48
to

                    delimiter -> {
                        field.append(nullCode)
                        flushField()
                        state = ParseState.DELIMITER
                    }
                    '\n', '\u2028', '\u2029', '\u0085' -> {
                        field.append(nullCode)
                        flushField()
                        state = ParseState.END
                    }
                    '\r' -> {
                        if (nextCh == '\n') pos += 1
                        field.append(nullCode)
                        flushField()
                        state = ParseState.END
                    }

and the same for https://github.com/doyaaaaaken/kotlin-csv/blob/c23a51b98fbbcb1348b3928d561f1f5e11fba965/src/commonMain/kotlin/com/github/doyaaaaaken/kotlincsv/parser/ParseStateMachine.kt#L87-L99

but I didn't check it thoroughly.

@doyaaaaaken
Copy link
Collaborator

@StefRe
Thank you for your feedback!!
I'll think about this on weekend, thanks.

@doyaaaaaken doyaaaaaken added the feature request New feature request label Apr 9, 2021
@doyaaaaaken
Copy link
Collaborator

@StefRe
Thank you for your input.
In your case, you want to regard "" as empty string and `` as null value.

It's maybe possible to have a case like the one you encountered.
However, I looked at the RFC 4180 and considered carefully, then I've felt that your case is a special case in terms of the CSV specification.

I'm very sorry, but I'll wait to deal with it until I get same feedback from others.

@StefRe
Copy link
Author

StefRe commented Apr 9, 2021

Thanks for your reply.
RFC 4180 just describes the file format with empty fields being explicitly allowed:

field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA

However, it doesn't say anything about how to interpret the data, specifically it doesn't mention if an empty field should be an empty string or null.

For my use case (a mock ResultSet backed by csv data) it is essential to differentiate between an empty string "" and null.

OpenCSV ("com.opencsv:opencsv:5.4") provides fine-grained configuration of what the parser should consider null:

EMPTY_SEPARATORS - two sequential separators are null.
EMPTY_QUOTES - two sequential quotes are null
BOTH - both are null
NEITHER - default. Both are considered empty string.

and allows to do exactly what I need:

import com.opencsv.CSVReaderHeaderAware
import com.opencsv.CSVReaderHeaderAwareBuilder
import com.opencsv.enums.CSVReaderNullFieldIndicator

val csv = """
    "empty col","null col"
    "",
""".trimIndent()

fun main() {
    csv.byteInputStream().bufferedReader().use { br ->
        (CSVReaderHeaderAwareBuilder(br)
            .withFieldAsNull(CSVReaderNullFieldIndicator.EMPTY_SEPARATORS)
            .build() as CSVReaderHeaderAware)
            .use { cr ->
                val csvData = cr.readMap()
                assert(csvData["empty col"] == "")
                assert(csvData["null col"] == null)
            }
    }
}

So I can switch to OpenCSV but it's a bit heavy-weight as it does far more than I actually need so I'd prefer to stick with your lib for Kotlin projects.

@doyaaaaaken
Copy link
Collaborator

@StefRe
Well, thank you for the good suggestions.
I'm convinced of your opinion.
Especially that configuration of OpenCSV seems to be also applicable for kotlin-csv.
I'd like to consider this topic as a feature request and try to implement it.

@doyaaaaaken doyaaaaaken added the enhancement New feature label Apr 11, 2021
@lucasgust
Copy link

I really want this feature...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature feature request New feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants