Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

构建 .table.bin 文件所需的资源和用时是否能优化 #583

Closed
yfdyh000 opened this issue Nov 18, 2022 · 2 comments · Fixed by #661
Closed

构建 .table.bin 文件所需的资源和用时是否能优化 #583

yfdyh000 opened this issue Nov 18, 2022 · 2 comments · Fixed by #661

Comments

@yfdyh000
Copy link

Is your feature request related to a problem? Please describe.
有衍生输入法(如鸿雁)采用超大词库。
但从 .dict.yaml 转换到 .table.bin 的过程,目前会花费大量内存和时间,并可能导致如小狼毫的 WeaselDeployer 程序因某些地方超出上限而崩溃。鸿雁改采用预编译并解压已编译文件到用户的 Rime/build 目录的方式,我认为这不太优雅,并可能给后续带来隐患(如table.bin的版本升级)。

https://forum.freemdict.com/t/topic/15303/134 的词库为例,90MB、300万行词条的 .dict.yaml 文件,在Windows平台下构建(部署)要花费超过1GB的内存,且内存占用是缓慢逐渐增长。

Describe the solution you'd like
观察相关代码来看,目前逻辑似乎是读取解析 yaml 文件,将数据放入内存中的对象,最后序列化为结果文件,没有考虑大型词库的性能问题。逻辑上来说,对大型文件应该是处理数据流的同时输出结果,避免将中间结果全放在内存里。我暂未探究 table.bin 的格式定义,但 dict.yaml 文件的格式似乎不复杂,可能不该完全解析再输出。如果bin文件需要在头部记录总计等信息,完全可先预留、程序中计数,文件结束时修改。

Describe alternatives you've considered
如果优化可行,也许还能造福 #121 (comment) 提到的移动平台上的内存限制问题。

Additional context
如果 table.bin 格式能再小一些并保持同等性能,就更好了,转换导致的膨胀似乎不小。以及是否有可能高性能压缩(如LZ4)。

@clijiac
Copy link

clijiac commented Jan 27, 2023

我现在的一个简单折中方案 就是自己编译了一个64位的rime_deployer.exe来单独部署大词库的情况

@lotem
Copy link
Member

lotem commented Jun 12, 2023

我認爲做這些優化不如改成用獨立的轉換工具離線部署。爲輸入服務的算法服務程序越簡單越好。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants