Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow optimizer performance when including large array of strings #39352

Open
joshtriplett opened this issue Jan 27, 2017 · 10 comments
Open
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@joshtriplett
Copy link
Member

joshtriplett commented Jan 27, 2017

Consider the following test program:

fn main() {
    let words = include!("num_array.rs");
    let mut l = 0;
    for word in words.into_iter() {
        l += word.len();
    }
    println!("{}", l);
}

num_array.rs contains an array of 250k strings:

$ (echo '['; seq 1 250000 | sed 's/.*/"&",/' ; echo ']') > num_array.rs
$ head -n 5 num_array.rs 
[
"1",
"2",
"3",
"4",
$ tail -n 5 num_array.rs 
"249997",
"249998",
"249999",
"250000",
]

Compiling this in debug mode took about 45 seconds; long, and potentially a good test case for compiler profiling, but not completely ridiculous.

Compiling this in release mode showed no signs of finishing after 45 minutes. stracing rustc showed two threads, one blocked in a futex and the other repeatedly allocating and freeing a memory buffer:

mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb665600000, 4194304, MADV_DONTNEED) = 0
madvise(0x7fb63e600000, 12582912, MADV_DONTNEED) = 0
madvise(0x7fb661000000, 25165824, MADV_DONTNEED) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb667a00000, 10485760, MADV_DONTNEED) = 0
madvise(0x7fb646000000, 100663296, MADV_DONTNEED) = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
mmap(0x7fb665a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb637400000
munmap(0x7fb637400000, 2097152)         = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb665600000, 4194304, MADV_DONTNEED) = 0
madvise(0x7fb63e600000, 12582912, MADV_DONTNEED) = 0
madvise(0x7fb661000000, 25165824, MADV_DONTNEED) = 0
madvise(0x7fb66531a000, 659456, MADV_DONTNEED) = 0
madvise(0x7fb6653da000, 139264, MADV_DONTNEED) = 0
madvise(0x7fb66540c000, 1576960, MADV_DONTNEED) = 0
madvise(0x7fb665a0c000, 2043904, MADV_DONTNEED) = 0
madvise(0x7fb667a00000, 10485760, MADV_DONTNEED) = 0
madvise(0x7fb646000000, 100663296, MADV_DONTNEED) = 0

By way of comparison, an analogous C program compiled with GCC takes 4.6s to compile without optimization, or 5.6s with optimization. Python parses and runs an analogous program in 1.2s. So, 45s seems excessive for an unoptimized compile, and 45m+ seems wildly excessive for an optimized compile.

Complete test case (ready to cargo run or cargo run --release): testcase.tar.gz

@brson brson added A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-optimization T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-compiletime Issue: Problems and improvements with respect to compile times. labels Jan 27, 2017
@brson
Copy link
Contributor

brson commented Jan 27, 2017

Great writeup. Thanks!

@nagisa
Copy link
Member

nagisa commented Jan 27, 2017

Parts of timings for non-optimised build:

time: 0.000; rss: 52MB	parsing
time: 0.693; rss: 159MB	expansion
time: 3.174; rss: 233MB	item-bodies checking
time: 0.164; rss: 244MB	const checking
time: 0.181; rss: 263MB	MIR dump
time: 0.453; rss: 263MB	borrow checking
time: 2.685; rss: 622MB	translation
  time: 9.955; rss: 533MB	codegen passes [0]
time: 10.560; rss: 533MB	LLVM passes
time: 3.200; rss: 144MB	linking
    Finished dev [unoptimized + debuginfo] target(s) in 22.49 secs

So the guess wrt parsing is wrong. Timings for optimised build are pretty much the same with exception of LLVM passes:

  time: 2.194; rss: 549MB	llvm function passes [0]
  time: ∞; rss: ∞	        llvm module passes [0]
  time: ∞; rss: ∞	        codegen passes [0]
  time: ∞; rss: ∞	        codegen passes [0]

so this is basically yet another super-linear case in the Super-Linear Optimisation Engine.


Notable differences in IR between clang (takes no time at all) and rustc are linkage type of string constants. clang specifies private unnamed_addr whereas rustc specifies internal.

Main function has 250k instances of

  %749955 = getelementptr inbounds [250000 x %str_slice], [250000 x %str_slice]* %words, i32 0, i32 249985
  %749956 = getelementptr inbounds %str_slice, %str_slice* %749955, i32 0, i32 0
  store i8* getelementptr inbounds ([6 x i8], [6 x i8]* @str.1321, i32 0, i32 0), i8** %749956
  %749957 = getelementptr inbounds %str_slice, %str_slice* %749955, i32 0, i32 1
  store i64 6, i64* %749957

generated to produce the on-stack array, whereas clang produces a constant and memcpys it instead. That almost guaranteed is the reason here.

@nagisa
Copy link
Member

nagisa commented Jan 28, 2017

Potentially fixed by having rvalue static promotion implemented.

@nagisa
Copy link
Member

nagisa commented Jan 28, 2017

So, even if we fixed this particular case of completely static array, there’s still the same issue when the array is not exactly fully static:

[ ..., "100000", some_var, "100001", ...]

I constructed a program like above in C and made clang compile it without optimisations. It still haven’t finished.

EDIT: it has finished (keep in mind, this is not optimised build):

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  2190.9531 ( 99.9%)   0.3600 ( 77.1%)  2191.3131 ( 99.9%)  2192.0474 ( 99.9%)  X86 DAG->DAG Instruction Selection

Which sounds right about where I’d expect the time be spent.

@joshtriplett
Copy link
Member Author

@nagisa Interesting. I can reproduce that result with clang with or without optimization, but gcc has no problem with such a program when not optimizing. When optimizing, it too doesn't show any signs of finishing.

The test code I used, for reference:

#include <string.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    char *array[] = {
#include "num_array.h"
        argv[0],
    };
    size_t sum = 0;
    for (int i = 0; i < sizeof(array)/sizeof(*array); i++) {
        sum += strlen(array[i]);
    }
    printf("%zu\n", sum);
    return 0;
}

To generate num_array.h, use seq 1 250000 | sed 's/.*/"&",/' > num_array.h .

@joshtriplett
Copy link
Member Author

I reported this test case as a GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79266

@semarie
Copy link
Contributor

semarie commented Feb 17, 2017

I would just point that the llvm embedded version missed two commits (present in 3.9.1) that could be related to this bug:

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 27, 2017
@nox
Copy link
Contributor

nox commented Apr 2, 2018

Cc @rust-lang/wg-compiler-performance

@joshtriplett
Copy link
Member Author

@avl
Copy link

avl commented May 30, 2020

For people coming here with a similar problem, looking for a workaround, I've noticed that include_bytes does not slow down compiles as much.

I had a case where I needed to include 100000 strings in a program. Initially I made a static array of them, but this was very slow to compile. I changed it to use two include_bytes instead, one with an array of binary u32 (offset, length)-pairs, and one with the actual (utf8-encoded) texts.

This compiled very quickly.

Edit: include_bytes is apparently not super-efficient either, but not as bad as a large array (see: #65818 ).

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants