Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readDelimiter variant for Regex as delimiter #746

Open
dave08 opened this issue Jun 20, 2024 · 5 comments
Open

readDelimiter variant for Regex as delimiter #746

dave08 opened this issue Jun 20, 2024 · 5 comments
Assignees
Labels
csv CSV / delim related issues enhancement New feature or request files reading/writing from/to files
Milestone

Comments

@dave08
Copy link

dave08 commented Jun 20, 2024

Maybe since this is a function to especially read delimeters, it might be useful to have an override that takes in a Regex as a delimiter... this might be used for command line output tables that are usually space separated but sometimes inside a column value there might be a single space, so I need to use "\s\s+" to correctly read it in.

@koperagen
Copy link
Collaborator

Hi. Library we're using now only has String and Char options for delimiter. Is your file a CSV/TSV or just a plain txt with some special format you want to parse?
image

@dave08
Copy link
Author

dave08 commented Jun 20, 2024

Say I have (output from kubectl get namespaces):

NAME                     STATUS   AGE      LABELS
argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
beta                     Active   4y235d   kubernetes.io/metadata.name=beta

Then I have multiple spacess as delimiters...

In some command line outputs, I have two words in one column:

NAME                                                                     CLUSTER        CDS        LDS        EDS        RDS          ECDS         ISTIOD                             VERSION
foo-5fcd67944f-2t97k.dev                                           Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-18-7-dbcdbb5f4-nth9n      1.18.7
foo-6f8bf4c9b9-qrwf9.prod                                          Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-16-7-6d46d45875-gxtzw     1.16.7

Like that NOT SENT... that's where a regex can help here. It's not just tabs, it's a bunch of spaces.

Also, how would you parse Markdown tables (or similar)...? Unless the library trims all those extra spaces... but I guess with markdown there might be more complications that just a delimiter.

@koperagen
Copy link
Collaborator

Good questions indeed. I think such tables should be parsed by readDelimStr in the future. For now i can only suggest something like this for Markdown.

fun String.markdownCells() = trim('|').split("|").map { it.trim() }

val s = """
| Month    | Savings |
| -------- | ------- |
| January  | $250    |
| February | $80     |
| March    | $420    |""".trimIndent()

val lines = s.lineSequence()
lines.drop(2).toList().toDataFrame().split { value }.by { it.markdownCells() }.into(lines.first().markdownCells())

@dave08
Copy link
Author

dave08 commented Jun 20, 2024

I think that's a bit of an advanced technique for most people with this kind of use case... and it involves parsing in two steps...

I wonder if some kind of readDSL would be better here... it could possibly work by line and give helpers for extracting the titles and values?

@koperagen koperagen added the enhancement New feature or request label Jun 20, 2024
@koperagen
Copy link
Collaborator

Please share desired API or example of usages that you have in mind. Maybe something like this could be added

@zaleslaw zaleslaw modified the milestones: 0.14.0, Backlog Jul 19, 2024
@zaleslaw zaleslaw added the files reading/writing from/to files label Jul 19, 2024
@Jolanrensen Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024
@Jolanrensen Jolanrensen self-assigned this Aug 20, 2024
@Jolanrensen Jolanrensen mentioned this issue Aug 20, 2024
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csv CSV / delim related issues enhancement New feature or request files reading/writing from/to files
Projects
None yet
Development

No branches or pull requests

4 participants