Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

解决导出表格标注时遇到的通用问题 #78

Merged
merged 2 commits into from
Sep 13, 2024

Conversation

BotAndyGao
Copy link
Contributor

解决导出表格标注时遇到的通用问题

可复现软件版本

  • PPOCRLabel v2.1.8

可复现问题资源

56

问题1:导出表格标注时添加colspan和rowspan时的异常

  • 问题描述
    导出表格标注时,从Excel中获取标注表格的格式,对合并列或是合并行添加colspan和rowspan时判断条件错误。
    例如:一个5行、5列的表格,第一行的第2列和第3列是合并的,第4列和第5列是合并的,代码中for循环构建出来的html_list 列表第一行是不对的。
错误结果:"<tr>", "<td>", "</td>", "<td", ">", "</td>", "<td>", "</td>", "<td", ">", "</td>", "<td>", "</td>", "</tr>"

期望结果:"<tr>", "<td>", "</td>", "<td", " colspan=\"2\"", ">", "</td>", "<td", " colspan=\"2\"", ">", "</td>", "</tr>"
当前代码:
html_list[sr][sc] = ""
    if ec - sc > 1:
        html_list[sr][sc] += " colspan={}".format(ec - sc)
    if er - sr > 1:
        html_list[sr][sc] += " rowspan={}".format(er - sr)

修改后代码:
html_list[sr][sc] = ""
    if ec - sc > 0:  # Only add colspan if the column span is more than 1
        html_list[sr][sc] += ' colspan={}'.format(ec - sc + 1)
    if er - sr > 0:  # Only add rowspan if the row span is more than 1
        html_list[sr][sc] += ' rowspan={}'.format(er - sr + 1)
  • 验证
    代码修改后经过60张表格图片导出标注的验证,所有图片均符合模型训练要求并完成模型训练。

问题2:导出的gt文件中gt属性中html标签合规的问题

  • 问题描述
    在rebuild_html_from_ppstructure_label方法中,生成的新html,colspan和rowspan的值不符合html直接显示标准。colspan和rowspan的值必须是数字,html的内容是在convert_token生成的,生成时colspan和rowspan的值均是字符串(应该是为了模型训练)。
错误结果: <tbody><tr><td></td><td colspan=“2”>本集团</td><td colspan=“2”>本银行</td></tr><tr><td></td><td>2023年</td><td>2022年</td><td>2023年</td><td>2022年</td></tr><tr><td></td><td>12月31日</td><td>12月31日</td><td>12月31日</td><td>12月31日</td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td>股权投资</td><td>6,489</td><td>7,131</td><td>6,081</td><td>6,726</td></tr></tbody>

期望结果: <tbody><tr><td></td><td colspan=2>本集团</td><td colspan=2>本银行</td></tr><tr><td></td><td>2023年</td><td>2022年</td><td>2023年</td><td>2022年</td></tr><tr><td></td><td>12月31日</td><td>12月31日</td><td>12月31日</td><td>12月31日</td></tr><tr><td></td><td></td><td></td><td></td><td></td></tr><tr><td>股权投资</td><td>6,489</td><td>7,131</td><td>6,081</td><td>6,726</td></tr></tbody>
当前代码:
    html_code = "".join(html_code)
    html_code = "<html><body><table>{}</table></body></html>".format(html_code)

修改后代码:
    html_code = "".join(html_code)
    html_code = re.sub(r'(colspan|rowspan)="(\d+)"', r'\1=\2', html_code)
    html_code = "<html><body><table>{}</table></body></html>".format(html_code)
  • 验证
    经过60张图片导出标注,将gt.txt中的gt属性拷贝到后缀为html的文件中,直接打开,在浏览器可以看到还原的表格,全部正确。

2、解决导出的gt文件中gt属性中html标签合规的问题
Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please install pre-commit, and run pre-commit run --all-files to fix code style.

@GreatV
Copy link
Collaborator

GreatV commented Sep 13, 2024

感谢大佬的贡献,另外麻烦将修改的变化,在readme中新增一下,中英文都补充一下。

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GreatV GreatV merged commit f9a6190 into PFCCLab:main Sep 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants