Read HTML string and generated PDF file in chunks. #949

joshuapinter · 2020-10-26T01:51:05Z

When the HTML content and generated PDF files get too large (how large is "too large" depends on your system, OS, config and available resources), trying to read all of the content into memory can lead to Errno::EINVAL errors like Invalid argument @ io_fread and Invalid argument @ io_write.

Initially, I thought this was a wkhtmltopdf issue so I posted an Issue there instead and quickly discovered the error was happening in wicked_pdf prior to running the wkhtmltopdf command.

Instead of reading this content entirely into memory, this content should be read in chunks to save memory usage.

After some benchmarking, a chunk size of 1MB was picked (1024 * 1024). Here are the benchmarks comparing different methods and chunk sizes for different content sizes:

13027836 bytes
13.03 MBs
                           user     system      total        real
write:                 0.000767   0.004443   0.005210 (  0.005312)
each_char:             5.756789   0.032231   5.789020 (  5.797378)
each_byte:             8.997680   0.067377   9.065057 (  9.179755)
StringIO 1 KB:         0.004029   0.006966   0.010995 (  0.011648)
StringIO 1 MB:         0.016100   0.007118   0.023218 (  0.023509)
StringIO 10 MB:        0.003347   0.006924   0.010271 (  0.010334)
StringIO 100 MB:       0.000456   0.003758   0.004214 (  0.007080)
StringIO 1 GB:         0.000468   0.003787   0.004255 (  0.005037)

706583272 bytes
0.71 GBs
                           user     system      total        real
write:                 0.001035   0.285726   0.286761 (  0.324529)
each_char:           362.444086   1.820033 364.264119 (365.362415)
each_byte:           548.788409   3.254867 552.043276 (553.390843)
StringIO 1 KB:         0.310588   0.331768   0.642356 (  0.697581)
StringIO 1 MB:         0.302101   0.325285   0.627386 (  0.671933)
StringIO 10 MB:        0.254845   0.294017   0.548862 (  0.895430)
StringIO 100 MB:       0.471879   0.429933   0.901812 (  1.181456)
StringIO 1 GB:         0.000471   0.260011   0.260482 (  0.653977)

5577825775 bytes
5.58 GBs
                            user     system       total         real
write:                     ERROR      ERROR       ERROR        ERROR
each_char:           2926.215017  38.658114 2964.873131 (3008.319599)
each_byte:           4305.082576  35.090730 4340.173306 (4363.944091)
StringIO 1 KB:          4.145908   3.962275    8.108183 (   9.490059)
StringIO 1 MB:          3.741062   2.779802    6.520864 (   7.423770)
StringIO 10 MB:         2.916272   2.553926    5.470198 (   6.271349)
StringIO 100 MB:        4.262794   3.007702    7.270496 (  10.986725)
StringIO 1 GB:          2.063459   4.572225    6.635684 (   9.212933)

You can see with the 5.58 GB content size, using write didn't even complete. Instead, I received a Errno::EINVAL error.

This allows significantly larger PDFs to be generated.

Additionally, instead of just throwing a cryptic Errno::EINVAL Invalid argument @ io_fread error, I added a rescue that logs an error with a helpful description indicating if the HTML content or PDF file is too large.

When the HTML content and generated PDF files get quite large (how large is too large depends on your system, OS, config and available resources), trying to read all of the content into memory can lead to `Errno::EINVAL` errors like `Invalid argument @ io_fread` and `Invalid argument @ io_write`. Instead of reading this content entirely into memory, this content should be read in chunks to save memory usage. After some benchmarking, a chunk size of 1MB was picked (`1024 * 1024`). Here are the benchmarks comparing different methods and chunk sizes for different content sizes: ``` 13027836 bytes 13.03 MBs user system total real write: 0.000767 0.004443 0.005210 ( 0.005312) each_char: 5.756789 0.032231 5.789020 ( 5.797378) each_byte: 8.997680 0.067377 9.065057 ( 9.179755) StringIO 1 KB: 0.004029 0.006966 0.010995 ( 0.011648) StringIO 1 MB: 0.016100 0.007118 0.023218 ( 0.023509) StringIO 10 MB: 0.003347 0.006924 0.010271 ( 0.010334) StringIO 100 MB: 0.000456 0.003758 0.004214 ( 0.007080) StringIO 1 GB: 0.000468 0.003787 0.004255 ( 0.005037) 706583272 bytes 0.71 GBs user system total real write: 0.001035 0.285726 0.286761 ( 0.324529) each_char: 362.444086 1.820033 364.264119 (365.362415) each_byte: 548.788409 3.254867 552.043276 (553.390843) StringIO 1 KB: 0.310588 0.331768 0.642356 ( 0.697581) StringIO 1 MB: 0.302101 0.325285 0.627386 ( 0.671933) StringIO 10 MB: 0.254845 0.294017 0.548862 ( 0.895430) StringIO 100 MB: 0.471879 0.429933 0.901812 ( 1.181456) StringIO 1 GB: 0.000471 0.260011 0.260482 ( 0.653977) 5577825775 bytes 5.58 GBs user system total real write: ERROR ERROR ERROR ERROR each_char: 2926.215017 38.658114 2964.873131 (3008.319599) each_byte: 4305.082576 35.090730 4340.173306 (4363.944091) StringIO 1 KB: 4.145908 3.962275 8.108183 ( 9.490059) StringIO 1 MB: 3.741062 2.779802 6.520864 ( 7.423770) StringIO 10 MB: 2.916272 2.553926 5.470198 ( 6.271349) StringIO 100 MB: 4.262794 3.007702 7.270496 ( 10.986725) StringIO 1 GB: 2.063459 4.572225 6.635684 ( 9.212933) ``` You can see with the 5.58 GB content size, using `write` didn't even complete. Instead, I received a `Errno::EINVAL` error. This allows significantly large PDFs to be generated. Additionally, instead of just throwing a cryptic `Errno::EINVAL Invalid argument @ io_fread` error, I added a `rescue` that logs an error with a helpful description indicating if the HTML content or PDF file is too large.

unixmonkey

Very nice. Thank you for this, and all the details in your PR, including benchmarks.

joshuapinter · 2020-10-30T16:44:28Z

Thank you thank you. 🙏

Roy-Mao · 2020-11-12T09:40:09Z

Thanks for this excellent work!

joshuapinter · 2020-11-12T17:03:15Z

@Roy-Mao You're very welcome! Is this something that you ran into as well?

joshuapinter force-pushed the read_content_in_chunks branch from e994a4e to 7685921 Compare October 26, 2020 02:12

unixmonkey added the hacktoberfest-accepted label Oct 30, 2020

unixmonkey approved these changes Oct 30, 2020

View reviewed changes

unixmonkey merged commit 8c007a7 into mileszs:master Oct 30, 2020

joshuapinter deleted the read_content_in_chunks branch October 30, 2020 16:44

unixmonkey mentioned this pull request Oct 31, 2020

Extract reading and writing files in chunks #951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read HTML string and generated PDF file in chunks. #949

Read HTML string and generated PDF file in chunks. #949

joshuapinter commented Oct 26, 2020 •

edited

Loading

unixmonkey left a comment

joshuapinter commented Oct 30, 2020

Roy-Mao commented Nov 12, 2020

joshuapinter commented Nov 12, 2020

Read HTML string and generated PDF file in chunks. #949

Read HTML string and generated PDF file in chunks. #949

Conversation

joshuapinter commented Oct 26, 2020 • edited Loading

unixmonkey left a comment

Choose a reason for hiding this comment

joshuapinter commented Oct 30, 2020

Roy-Mao commented Nov 12, 2020

joshuapinter commented Nov 12, 2020

joshuapinter commented Oct 26, 2020 •

edited

Loading