Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.ReadAll impact the OpenReader function memory usage #2086

Closed
ijustyce opened this issue Feb 6, 2025 · 4 comments
Closed

io.ReadAll impact the OpenReader function memory usage #2086

ijustyce opened this issue Feb 6, 2025 · 4 comments
Labels
duplicate This issue or pull request already exists

Comments

@ijustyce
Copy link

ijustyce commented Feb 6, 2025

OpenReader 源码如下:

func OpenReader(r io.Reader, opts ...Options) (*File, error) {
	b, err := io.ReadAll(r)
	if err != nil {
		return nil, err
	}
	f := newFile()
	f.options = f.getOptions(opts...)
	if err = f.checkOpenReaderOptions(); err != nil {
		return nil, err
	}
	if bytes.Contains(b, oleIdentifier) {
		if b, err = Decrypt(b, f.options); err != nil {
			return nil, ErrWorkbookFileFormat
		}
	}
	zr, err := zip.NewReader(bytes.NewReader(b), int64(len(b)))
	if err != nil {
		if len(f.options.Password) > 0 {
			return nil, ErrWorkbookPassword
		}
		return nil, err
	}
	file, sheetCount, err := f.ReadZipReader(zr)
	if err != nil {
		return nil, err
	}
	f.SheetCount = sheetCount
	for k, v := range file {
		f.Pkg.Store(k, v)
	}
	if f.CalcChain, err = f.calcChainReader(); err != nil {
		return f, err
	}
	if f.sheetMap, err = f.getSheetMap(); err != nil {
		return f, err
	}
	if f.Styles, err = f.stylesReader(); err != nil {
		return f, err
	}
	f.Theme, err = f.themeReader()
	return f, err
}

其中 io.ReadAll 会导致将所有数据加载到内存,这和流式读取的初衷违背;这里 io.ReadAll 的结果 b 的使用相关代码是:

if bytes.Contains(b, oleIdentifier) {
	if b, err = Decrypt(b, f.options); err != nil {
		return nil, ErrWorkbookFileFormat
	}
}
zr, err := zip.NewReader(bytes.NewReader(b), int64(len(b)))

建议将这串代码直接改为:

zr, err := zip.OpenReader(name)

并将 file, sheetCount, err := f.ReadZipReader(zr) 改为 file, sheetCount, err := f.ReadZipReader(&zr.Reader) 即可实现真正的流式读取。
我在本地 replace 后测试,修改后,读取 7.8G 的 excel 文件,内存占用不到 10M,而修改前 32G 内存占用。

但并不确定如下代码的用途:

if bytes.Contains(b, oleIdentifier) {
	if b, err = Decrypt(b, f.options); err != nil {
		return nil, ErrWorkbookFileFormat
	}
}

该代码看起来是兼容老版本的 Excel,但具体我并不清楚,在我的测试中,Excel 文件后缀为:xlsx,且读取的数据和实际数据一致。

所以,我的建议是,能否按上面的修改来确保流式读取是真正的流式读取,而不是先 io.ReadAll 全部读取到内存,然后才流式读取,这无意义。如果上面我不确定用途的代码,有实际含义,那么能否在流式读取中兼容它?如果不兼容,那么能否提供不兼容部分格式的流式读取呢?

类似的 issue 有:https://github.com/qax-os/excelize/issues/1775,但该 issue 已经关闭。

ijustyce pushed a commit to ijustyce/excelize that referenced this issue Feb 6, 2025
@ijustyce
Copy link
Author

ijustyce commented Feb 6, 2025

进一步阅读源码发现上面我提到的代码:

if bytes.Contains(b, oleIdentifier) {
	if b, err = Decrypt(b, f.options); err != nil {
		return nil, ErrWorkbookFileFormat
	}
}

的用途应该是解密 加密的 Excel 2003,oleIdentifier 指的是 CFB 格式,即 Excel 2003,如果 Excel 本身没加密,或采用 Excel 2007 进行加密的,那么这串代码应该可以删除。

@xuri
Copy link
Member

xuri commented Feb 8, 2025

Thanks for your suggestion, would you like to create pull request for this?

@ijustyce
Copy link
Author

ijustyce commented Feb 9, 2025

Thanks for your suggestion, would you like to create pull request for this?

#2087

@xuri xuri added the duplicate This issue or pull request already exists label Feb 12, 2025
@xuri
Copy link
Member

xuri commented Feb 12, 2025

Thanks for your PR. I've left some comments for your PR. Due to duplicated with issue #1775, I closed this.

@xuri xuri closed this as completed Feb 12, 2025
@xuri xuri changed the title 针对 OpenReader 调用了 io.ReadAll 导致读取超大 excel 时内存占用非常大的优化建议 io.ReadAll impact the OpenReader function memory usage Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants