Extremely slow optimizer performance when including large array of strings #39352

joshtriplett · 2017-01-27T22:33:49Z

Consider the following test program:

fn main() {
    let words = include!("num_array.rs");
    let mut l = 0;
    for word in words.into_iter() {
        l += word.len();
    }
    println!("{}", l);
}

num_array.rs contains an array of 250k strings:

$ (echo '['; seq 1 250000 | sed 's/.*/"&",/' ; echo ']') > num_array.rs
$ head -n 5 num_array.rs 
[
"1",
"2",
"3",
"4",
$ tail -n 5 num_array.rs 
"249997",
"249998",
"249999",
"250000",
]

Compiling this in debug mode took about 45 seconds; long, and potentially a good test case for compiler profiling, but not completely ridiculous.

Compiling this in release mode showed no signs of finishing after 45 minutes. stracing rustc showed two threads, one blocked in a futex and the other repeatedly allocating and freeing a memory buffer:

mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb665600000, 4194304, MADV_DONTNEED) = 0
madvise(0x7fb63e600000, 12582912, MADV_DONTNEED) = 0
madvise(0x7fb661000000, 25165824, MADV_DONTNEED) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb667a00000, 10485760, MADV_DONTNEED) = 0
madvise(0x7fb646000000, 100663296, MADV_DONTNEED) = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb665600000, 4194304, MADV_DONTNEED) = 0
madvise(0x7fb63e600000, 12582912, MADV_DONTNEED) = 0
madvise(0x7fb661000000, 25165824, MADV_DONTNEED) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb667a00000, 10485760, MADV_DONTNEED) = 0
madvise(0x7fb646000000, 100663296, MADV_DONTNEED) = 0

By way of comparison, an analogous C program compiled with GCC takes 4.6s to compile without optimization, or 5.6s with optimization. Python parses and runs an analogous program in 1.2s. So, 45s seems excessive for an unoptimized compile, and 45m+ seems wildly excessive for an optimized compile.

Complete test case (ready to cargo run or cargo run --release): testcase.tar.gz

The text was updated successfully, but these errors were encountered:

brson · 2017-01-27T23:13:01Z

Great writeup. Thanks!

nagisa · 2017-01-27T23:48:28Z

Parts of timings for non-optimised build:

time: 0.000; rss: 52MB	parsing
time: 0.693; rss: 159MB	expansion
time: 3.174; rss: 233MB	item-bodies checking
time: 0.164; rss: 244MB	const checking
time: 0.181; rss: 263MB	MIR dump
time: 0.453; rss: 263MB	borrow checking
time: 2.685; rss: 622MB	translation
  time: 9.955; rss: 533MB	codegen passes [0]
time: 10.560; rss: 533MB	LLVM passes
time: 3.200; rss: 144MB	linking
    Finished dev [unoptimized + debuginfo] target(s) in 22.49 secs

So the guess wrt parsing is wrong. Timings for optimised build are pretty much the same with exception of LLVM passes:

  time: 2.194; rss: 549MB	llvm function passes [0]
  time: ∞; rss: ∞	        llvm module passes [0]
  time: ∞; rss: ∞	        codegen passes [0]
  time: ∞; rss: ∞	        codegen passes [0]

~~so this is basically yet another super-linear case in the Super-Linear Optimisation Engine.~~

~~Notable differences in IR between clang (takes no time at all) and rustc are linkage type of string constants. clang specifies private unnamed_addr whereas rustc specifies internal.~~

Main function has 250k instances of

  %749955 = getelementptr inbounds [250000 x %str_slice], [250000 x %str_slice]* %words, i32 0, i32 249985
  %749956 = getelementptr inbounds %str_slice, %str_slice* %749955, i32 0, i32 0
  store i8* getelementptr inbounds ([6 x i8], [6 x i8]* @str.1321, i32 0, i32 0), i8** %749956
  %749957 = getelementptr inbounds %str_slice, %str_slice* %749955, i32 0, i32 1
  store i64 6, i64* %749957

generated to produce the on-stack array, whereas clang produces a constant and memcpys it instead. That almost guaranteed is the reason here.

nagisa · 2017-01-28T17:07:43Z

Potentially fixed by having rvalue static promotion implemented.

nagisa · 2017-01-28T19:12:15Z

So, even if we fixed this particular case of completely static array, there’s still the same issue when the array is not exactly fully static:

[ ..., "100000", some_var, "100001", ...]

I constructed a program like above in C and made clang compile it without optimisations. It still haven’t finished.

EDIT: it has finished (keep in mind, this is not optimised build):

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  2190.9531 ( 99.9%)   0.3600 ( 77.1%)  2191.3131 ( 99.9%)  2192.0474 ( 99.9%)  X86 DAG->DAG Instruction Selection

Which sounds right about where I’d expect the time be spent.

joshtriplett · 2017-01-28T19:40:19Z

@nagisa Interesting. I can reproduce that result with clang with or without optimization, but gcc has no problem with such a program when not optimizing. When optimizing, it too doesn't show any signs of finishing.

The test code I used, for reference:

#include <string.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    char *array[] = {
#include "num_array.h"
        argv[0],
    };
    size_t sum = 0;
    for (int i = 0; i < sizeof(array)/sizeof(*array); i++) {
        sum += strlen(array[i]);
    }
    printf("%zu\n", sum);
    return 0;
}

To generate num_array.h, use seq 1 250000 | sed 's/.*/"&",/' > num_array.h .

joshtriplett · 2017-01-28T19:51:24Z

I reported this test case as a GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79266

semarie · 2017-02-17T04:32:35Z

I would just point that the llvm embedded version missed two commits (present in 3.9.1) that could be related to this bug:

nox · 2018-04-02T11:53:35Z

Cc @rust-lang/wg-compiler-performance

joshtriplett · 2019-12-12T10:08:39Z

Still an issue with current compilers:

https://internals.rust-lang.org/t/x-post-from-users-forum-long-compile-times-for-a-vector-with-50k-u64-values/11472

https://users.rust-lang.org/t/long-compile-times-for-a-vector-with-50k-u64-values/35625

avl · 2020-05-30T09:48:03Z

For people coming here with a similar problem, looking for a workaround, I've noticed that include_bytes does not slow down compiles as much.

I had a case where I needed to include 100000 strings in a program. Initially I made a static array of them, but this was very slow to compile. I changed it to use two include_bytes instead, one with an array of binary u32 (offset, length)-pairs, and one with the actual (utf8-encoded) texts.

This compiled very quickly.

Edit: include_bytes is apparently not super-efficient either, but not as bad as a large array (see: #65818 ).

Mark-Simulacrum removed the A-optimization label Jun 23, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 27, 2017

skyzh mentioned this issue Sep 28, 2021

coprocessor: add gbk_bin support in coprocessor tikv/tikv#10996

Merged

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow optimizer performance when including large array of strings #39352

Extremely slow optimizer performance when including large array of strings #39352

joshtriplett commented Jan 27, 2017 •

edited

Loading

brson commented Jan 27, 2017

nagisa commented Jan 27, 2017 •

edited

Loading

nagisa commented Jan 28, 2017

nagisa commented Jan 28, 2017 •

edited

Loading

joshtriplett commented Jan 28, 2017

joshtriplett commented Jan 28, 2017

semarie commented Feb 17, 2017

nox commented Apr 2, 2018

joshtriplett commented Dec 12, 2019

avl commented May 30, 2020 •

edited

Loading

Extremely slow optimizer performance when including large array of strings #39352

Extremely slow optimizer performance when including large array of strings #39352

Comments

joshtriplett commented Jan 27, 2017 • edited Loading

brson commented Jan 27, 2017

nagisa commented Jan 27, 2017 • edited Loading

nagisa commented Jan 28, 2017

nagisa commented Jan 28, 2017 • edited Loading

joshtriplett commented Jan 28, 2017

joshtriplett commented Jan 28, 2017

semarie commented Feb 17, 2017

nox commented Apr 2, 2018

joshtriplett commented Dec 12, 2019

avl commented May 30, 2020 • edited Loading

joshtriplett commented Jan 27, 2017 •

edited

Loading

nagisa commented Jan 27, 2017 •

edited

Loading

nagisa commented Jan 28, 2017 •

edited

Loading

avl commented May 30, 2020 •

edited

Loading