Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly parse shared strings in the streaming implementation #178

Merged
merged 2 commits into from
Jun 6, 2024

Conversation

sergesku
Copy link
Contributor

@sergesku sergesku commented Jun 5, 2024

At the moment Codec.Xlsx.Parser.Stream.parseSharedStrings implementation incorrectly parses shared strings values:
it creates a separate values for each formatted piece of text. This creates skewed data in the result.

Here is an example of failing test:

textA1 = "Text at A1"
firstClauseB1 = "First clause at B1;"
firstClauseB2 = "First clause at B2;"
secondClauseB1 = "Second clause at B1"
secondClauseB2 = "Second clause at B2"

cellRich :: Text -> Text -> CellValue
cellRich firstClause secondClause = CellRich
  [ RichTextRun
      { _richTextRunProperties = SomeProperties
      , _richTextRunText = firstClause
      }
  , RichTextRun
      { _richTextRunProperties = SomeOtherProperties
      , _richTextRunText = secondClause
      }
  ]

richWorkbook :: Xlsx
richWorkbook = def & atSheet "Sheet1" ?~ toWs
  [ ((RowIndex 1, ColumnIndex 1), cellValue ?~ CellText textA1 $ def)
  , ((RowIndex 2, ColumnIndex 1), cellValue ?~ cellRich firstClauseB1 secondClauseB1 $ def)
  , ((RowIndex 2, ColumnIndex 2), cellValue ?~ cellRich firstClauseB2 secondClauseB2 $ def)
  ]

        Expected:
        fromList
          ["First clause at B1;Second clause at B1",
           "First clause at B2;Second clause at B2", "Text at A1"]
        
        Difference:
        2,3c2,3
        <   ["First clause at B1;Second clause at B1",
        <    "First clause at B2;Second clause at B2", "Text at A1"]
        ---
        >   ["First clause at B1;", "First clause at B2;",
        >    "Second clause at B1", "Second clause at B2", "Text at A1"]

This PR fixes this behavior. Now we parse a shared string as a single cell value.

P.S. We still create CellValue as CellText, not as CellRich.

@qrilka
Copy link
Owner

qrilka commented Jun 5, 2024

Thanks @sergesku
I'll take a look into it soon.

@qrilka
Copy link
Owner

qrilka commented Jun 5, 2024

P.S. We still create CellValue as CellText, not as CellRich.

Do you mean that for the streamed parser?
Would you mind adding that as a know TODO? Probably to the function you changed
The PR looks good

@sergesku
Copy link
Contributor Author

sergesku commented Jun 6, 2024

Do you mean that for the streamed parser?

Yes. Thus, if we have a cell with a Rich Text value in an a file, at the exit from the stream parser we have a regular CellText value.

@qrilka qrilka merged commit 87f74bb into qrilka:master Jun 6, 2024
5 checks passed
@qrilka
Copy link
Owner

qrilka commented Jun 6, 2024

Thanks @sergesku for your contribution
I plan to put this on Hackage after #177 will land on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants