Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tables with empty first cell #380

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joouha
Copy link

@joouha joouha commented Mar 23, 2022

Hello

This PR fixes an issue where an empty first cell in a table results in markdown tables which do parse properly.

For example:

<table>
  <tr><th></th><th>b</th></tr>
  <tr><td>c</td><td>d</td></tr>
</table>

results in the following output:

| b
---|---
c | d

which is not a valid markdown table:

| b
---|---
c | d


With this fix, the output is:

| | b
---|---
c | d

which renders correctly:

b
c d

@joouha
Copy link
Author

joouha commented Mar 27, 2022

If anyone else is experiencing the same issue, the following regex substitution can be used as a workaround:

import re
from html2text import HTML2Text

data = "<table><tr><td></td><td>a</td></tr></table>"

result = HTML2Text().handle(data)

print(result)

# | a
# ---|---

result = re.sub(
    r"^((\|\s+[^\|]+\s+)((\|\s+[^\|]+\s+|:?-+:?\|)(\|\s+[^\|]+\s+|:?-+:?\|))*:?-+:?\|:?-+:?\s*$)",
    r"|  \1", 
    result,
    0,
    re.MULTILINE,
)

print(result)


print(result)

# |  | a
# ---|---

It adds an empty cell at the start of the table if the number of header cells does not match the number of columns in the table.

@Alir3z4
Copy link
Owner

Alir3z4 commented Jan 16, 2024

Can you please update the code with tests?

If the first cell of a table is empty, no entry was created for in in the
markdown table. This resulted in markdown tables which do parse properly.
@joouha joouha force-pushed the fix/table-first-row branch from dff8aa3 to 41f88f5 Compare January 17, 2024 11:13
@Alir3z4
Copy link
Owner

Alir3z4 commented Jan 17, 2024

@joouha Thanks for adding the tests and the patch.
Code looks ok, but after running the CI, it failed.

You can run the tests locally by tox and check the results yourself as well.

@joouha joouha marked this pull request as draft January 17, 2024 13:28
@eigen2017
Copy link

i encountered same issue, like:
---|---|---

right expression need to be:
|---|---|---|

@eigen2017
Copy link

i encountered same issue, like: ---|---|---

right expression need to be: |---|---|---|

html2text.config.PAD_TABLES = True
can solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants