Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int_vector is substantially slower than std::vector #43

Open
h-2 opened this issue Jul 2, 2018 · 7 comments
Open

int_vector is substantially slower than std::vector #43

h-2 opened this issue Jul 2, 2018 · 7 comments

Comments

@h-2
Copy link

h-2 commented Jul 2, 2018

Even if the entire amount of storage is pre-reserved (no memory allocations) calling push_back on the sdsl vector is substantially slower that doing the same on std::vector (up to 10x slower).

push_back<std::vector, uint8_t, true>      1 ns          1 ns  847837409 alph_size=256 preallocated_mem=1 sizeof=1
push_back<std::vector, uint16_t, true>     1 ns          1 ns  658253559 alph_size=65.536k preallocated_mem=1 sizeof=2
push_back<std::vector, uint32_t, true>     2 ns          2 ns  448232364 alph_size=4.29497G preallocated_mem=1 sizeof=4
push_back<std::vector, uint64_t, true>     2 ns          2 ns  277192593 preallocated_mem=1 sizeof=8
push_back<sdsl_int_vec, uint8_t, true>    10 ns         10 ns   71418369 alph_size=256 preallocated_mem=1 sizeof=1
push_back<sdsl_int_vec, uint16_t, true>    9 ns          9 ns   73339899 alph_size=65.536k preallocated_mem=1 sizeof=2
push_back<sdsl_int_vec, uint32_t, true>   10 ns         10 ns   66514001 alph_size=4.29497G preallocated_mem=1 sizeof=4
push_back<sdsl_int_vec, uint64_t, true>   11 ns         11 ns   57827822 preallocated_mem=1 sizeof=8

(the fourth column is number of iterations performed in fixed amount of time)

I thought that, especially, for the builtin integer types this shouldn't be so noticeable?

@99991
Copy link

99991 commented Apr 6, 2022

The function end() is at fault here:

iterator end() noexcept { return int_vector_trait<t_width>::end(this, m_data, (m_size / m_width)); }

m_size / m_width is a relatively slow operation since the compiler is not smart enough to infer that m_width never changes and could be optimized to a right shift.

Here is a benchmark using the slow end function:

#include <iostream>
#include <chrono>
#include <sdsl/bit_vectors.hpp>

#include <vector>

using namespace std;
using namespace sdsl;

void run_benchmark(){
    size_t num_runs = 1000 * 1000;

    int_vector<64> values;

    values.reserve(num_runs);

    chrono::steady_clock::time_point begin = chrono::steady_clock::now();

    for (size_t i = 0; i < num_runs; i++){
        values.amortized_resize(i + 1);

        // Uncomment to make it go fast.
        //values.m_width = 64;
        auto end = int_vector_trait<64>::end(&values, values.m_data, (values.m_size / values.m_width));

        *(end - 1) = i;
    }

    chrono::steady_clock::time_point end = chrono::steady_clock::now();

    double mean = chrono::duration_cast<chrono::nanoseconds> (end - begin).count() / (double)num_runs;

    cout << "On average, it took " << mean << " nanoseconds to push_back one value." << endl;
}

int main(){

    // Run benchmark a few times just to be sure.
    for (size_t i = 0; i < 9; i++){
        cout << "Run " << i + 1 << ": ";
        run_benchmark();
    }

    return 0;
}

And here are the results on my computer:

Run 9: On average, it took 8.11565 nanoseconds to push_back one value.

And after uncommenting the line values.m_width = 64;, the compiler is smart enough to make the optimization:

Run 9: On average, it took 1.71234 nanoseconds to push_back one value.

I do not understand why m_width exists. It seems like it is initialized with t_width at the beginning and then never written to again. It might be a good idea to simply delete m_width and replace all occurrences with t_width instead.

@h-2
Copy link
Author

h-2 commented Apr 6, 2022

I do not understand why m_width exists. It seems like it is initialized with t_width at the beginning and then never written to again. It might be a good idea to simply delete m_width and replace all occurrences with t_width instead.

I haven't double-checked this, but it seems like an easy fix.

@eseiler what do you think?

@h-2
Copy link
Author

h-2 commented Apr 6, 2022

I just checked. The width can be dynamic:

* \tparam t_width Width of the integer. If set to `0` it is variable
* during runtime, otherwise fixed at compile time.

void width(uint8_t new_width) noexcept { int_vector_trait<t_width>::set_width(new_width, m_width); }

static void set_width(uint8_t new_width, int_width_type & width) noexcept
{
if (t_width == 0)
{
if (0 < new_width and new_width <= 64)
width = new_width;
else
width = 64;
}
}
};

So removing m_width is not an option.

However, replacing:

iterator end() noexcept { return int_vector_trait<t_width>::end(this, m_data, (m_size / m_width)); }

with the following should work:

iterator end() noexcept
{ 
    if constexpr (t_width == 0) // dynamic width
        return int_vector_trait<t_width>::end(this, m_data, (m_size / m_width));
    else
        return int_vector_trait<t_width>::end(this, m_data, (m_size / t_width));
}

@eseiler
Copy link
Collaborator

eseiler commented Apr 19, 2022

Sorry for the late reply, I was sick :(

Yes, we need both, as the bitvector may be dynamic.
I'll try getting around to implement the if constexpr and adding a benchmark (or running one from seqan3).

@eseiler
Copy link
Collaborator

eseiler commented Apr 19, 2022

#102

I managed to go from 9.2ns to 2.8ns.

@99991 Are you able to reproduce your times with my PR?

@99991
Copy link

99991 commented Apr 19, 2022

@eseiler I get around 3.5 ns with the linked benchmark and 2 ns after replacing sdsl::int_vector<64> with std::vector<int64_t>.

@eseiler
Copy link
Collaborator

eseiler commented Apr 20, 2022

Thanks for the feedback, @99991 !

I tweaked it a bit more. The benchmark now uses random values.
I got rid of some expensive math functions in amortized_resize, which now also has a version for t_width and one for m_width.

I think I'm going ahead and merge #102 as it already improves the performance quite a lot.
Having said that, I would be very curious whether you also see the improvements I saw with changing amortized_resize.

Another thing to test would be the growth_factor. There's a comment in the benchmark.
Setting it to 2 often gets me close to 2ns. Leaving it at the default (1.5) is never much faster than 2.5ns.

One thing to note is that the benchmark is a microbenchmark, and even with a moderately sized vector (in the benchmark it's 64 million elements, so it should avoid most cache effects), the numbers fluctuate quite a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants