-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements #365
Comments
Very nice! |
I can confirm the improvements. I shall create a feature branch. |
There are tests failing with Linux: https://travis-ci.org/nlohmann/json/builds/177884351 |
dump jeopardy.json with indent |
@gregmarr When I ran the benchmarks, I did not notice the deviation. However, the numbers differed from run to run. I shall have a second look and post my results here. Right now, I need to understand why the tests fail on Linux... |
@gregmarr, this must be experimental variance.
Why errors though... The changes looked so functionally-invariant to me. |
Ah, I had assumed that this already did multiple runs and averaged the results to eliminate the variances. |
I'm not sure if it's causing the failures here, but the reinterpret_cast might be a problem.
The former essentially converts a character from the lexer type to the string type. If the two types are different sizes, the compiler converts from one to the other. The latter treats a pointer to an array of lexer character types to a pointer to an array of string character types. If the two character types are different sizes, then this is a major bug as the string data is interpreted improperly. Additionally, if the string character type is larger than the lexer character type, then this is an array out of bounds access. You probably want to use this overload instead:
Similarly, here:
should probably be
Note that with assign you don't need to clear first. |
The tests seem to succeed, but the benchmarks are not that stable:
I won't merge this right now. I think there should be an improvement, but it seems to be time to change the benchmarking tool... Any ideas? |
WRT benchmarking: WRT changes: I'm not saying that there's an active bug, but that it is fraught with danger. Perhaps make use of m_line_buffer_tmp there? // copy unprocessed characters to line buffer
m_line_buffer_tmp.assign(m_start, m_limit);
std::swap(m_line_buffer, m_line_buffer_tmp); |
I think the intent is that the code under
However, the current test doesn't really cover that properly. The m_start can point into m_line_buffer without being at the start. |
@gregmarr, the following innocent-looking change causes json_benchmarks to segfault for me, so here-be-dragons. --- json.hpp.1 2016-11-23 08:05:24.107402000 -0500
+++ json.hpp 2016-11-23 08:07:30.069845000 -0500
@@ -8733,6 +8733,7 @@
if (m_start != reinterpret_cast<const lexer_char_t*>(m_line_buffer.data()))
{
// copy unprocessed characters to line buffer
+ m_line_buffer.reserve(static_cast<size_t>(n + m_limit - m_start));
m_line_buffer.assign(m_start, m_limit);
m_cursor = m_limit;
} @nlohmann I'd hope this would do more good than pin-thread-to-cpu, but I have not tried it. |
I'll have a look. It may take some time, because I got a new computer and will need some time to setup... |
That change is unnecessary but does point out that the check needs to be improved to detect more cases where the buffer is in use. |
I refactored the benchmarking code as follows; I think this would yield the timings as consistently as we can reasonably expect #define BENCHPRESS_CONFIG_MAIN
#include <fstream>
#include <sstream>
#include <benchpress.hpp>
#include <json.hpp>
#include <pthread.h>
#include <thread>
struct StartUp
{
StartUp()
{
#ifndef __llvm__
cpu_set_t cpuset;
pthread_t thread;
thread = pthread_self();
CPU_ZERO(&cpuset);
CPU_SET(std::thread::hardware_concurrency() - 1, &cpuset);
pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
#endif
}
};
StartUp startup;
enum class EMode { input, output_no_indent, output_with_indent };
static void bench(
benchpress::context& ctx,
const std::string& in_path,
const EMode mode)
{
// using stringstreams for benchmarking
// to factor-out cold-cache disk access.
std::stringstream istr;
{
std::ifstream input_file(in_path);
istr << input_file.rdbuf();
input_file.close();
// read the stream once
nlohmann::json j;
j << istr;
istr.clear(); // clear flags
istr.seekg(0);
}
ctx.reset_timer();
if(mode == EMode::input) {
// benchmarking input
for (size_t i = 0; i < ctx.num_iterations(); ++i)
{
istr.clear(); // clear flags
istr.seekg(0);
nlohmann::json j;
j << istr;
}
} else {
// benchmarking output
nlohmann::json j;
j << istr;
std::stringstream ostr;
ctx.reset_timer();
for (size_t i = 0; i < ctx.num_iterations(); ++i)
{
if(mode == EMode::output_no_indent) {
ostr << j;
} else {
ostr << std::setw(4) << j;
}
ostr.str(std::string()); // reset data;
}
}
}
#define BENCHMARK_I(mode, title, in_path) \
BENCHMARK((title), [](benchpress::context* ctx) \
{ \
bench(*ctx, (in_path), (mode)); \
})
BENCHMARK_I(EMode::input, "parse jeopardy.json", "benchmarks/files/jeopardy/jeopardy.json")
BENCHMARK_I(EMode::input, "parse canada.json", "benchmarks/files/nativejson-benchmark/canada.json");
BENCHMARK_I(EMode::input, "parse citm_catalog.json", "benchmarks/files/nativejson-benchmark/citm_catalog.json");
BENCHMARK_I(EMode::input, "parse twitter.json", "benchmarks/files/nativejson-benchmark/twitter.json");
BENCHMARK_I(EMode::input, "parse numbers/floats.json", "benchmarks/files/numbers/floats.json");
BENCHMARK_I(EMode::input, "parse numbers/signed_ints.json", "benchmarks/files/numbers/signed_ints.json");
BENCHMARK_I(EMode::input, "parse numbers/unsigned_ints.json", "benchmarks/files/numbers/unsigned_ints.json");
//unsigned_ints can be yanked, as it is largerly similar to signed_ints
BENCHMARK_I(EMode::output_no_indent, "dump jeopardy.json", "benchmarks/files/jeopardy/jeopardy.json");
BENCHMARK_I(EMode::output_with_indent, "dump jeopardy.json with indent", "benchmarks/files/jeopardy/jeopardy.json");
BENCHMARK_I(EMode::output_no_indent, "dump numbers/floats.json", "benchmarks/files/numbers/floats.json");
BENCHMARK_I(EMode::output_no_indent, "dump numbers/signed_ints.json", "benchmarks/files/numbers/signed_ints.json"); Observations: after @gregmarr's fixes the performance improvement is still there, except for parse jeopardy.json, whose performance is in-between that of develop branch (slowest) and the head of issue365 branch (my original proposal with the sloppy reinterpret_cast). |
Indeed - the numbers of the modified benchmark program are much more stable. I'll commit an update in a minute. I added the code for the unsigned integers, because I was planning to split the integer conversion for a signed and an unsigned case. |
Ok, I let some more benchmarks run and the code in https://github.com/nlohmann/json/tree/feature/issue365 is faster than the one in the development branch. So where should we go from here? Do you still see potential errors? |
I'll take a closer look at the code section that looked suspicious. |
Please see below. index 8704134..33afb99 100644
--- a/src/json.hpp
+++ b/src/json.hpp
@@ -8719,23 +8719,47 @@ basic_json_parser_66:
*/
void fill_line_buffer(size_t n = 0)
{
+ assert(m_line_buffer.empty()
+ || m_content == reinterpret_cast<const lexer_char_t*>(m_line_buffer.data()));
+
+ assert(m_line_buffer.empty()
+ || m_limit == m_content + m_line_buffer.size());
+
+ assert(m_content <= m_start);
+ assert(m_start <= m_cursor);
+ assert(m_cursor <= m_limit);
+ assert(m_marker <= m_limit || !m_marker);
+
// number of processed characters (p)
- const auto offset_start = m_start - m_content;
+ const size_t num_processed_chars = static_cast<size_t>(m_start - m_content);
+
// offset for m_marker wrt. to m_start
const auto offset_marker = (m_marker == nullptr) ? 0 : m_marker - m_start;
+
// number of unprocessed characters (u)
const auto offset_cursor = m_cursor - m_start;
// no stream is used or end of file is reached
if (m_stream == nullptr or m_stream->eof())
{
- // skip this part if we are already using the line buffer
- if (m_start != reinterpret_cast<const lexer_char_t*>(m_line_buffer.data()))
+
+#if 1 //https://stackoverflow.com/questions/28142011/can-you-assign-a-substring-of-a-stdstring-to-itself
+
+ // m_start may or may not be pointing into
+ // m_line_buffer at this point. We trust the
+ // standand library to do the right thing.
+ m_line_buffer.assign(m_start, m_limit);
+#else
+ if(m_line_buffer.empty())
{
- // copy unprocessed characters to line buffer
+ // start using the buffer
m_line_buffer.assign(m_start, m_limit);
- m_cursor = m_limit;
+ } else {
+ // delete processed characters from line buffer.
+ // semantically this is the same as the .assign(...) above
+ m_line_buffer.erase(0, num_processed_chars);
}
+#endif
// append n characters to make sure that there is sufficient
// space between m_cursor and m_limit
@@ -8745,7 +8769,8 @@ basic_json_parser_66:
else
{
// delete processed characters from line buffer
- m_line_buffer.erase(0, static_cast<size_t>(offset_start));
+ m_line_buffer.erase(0, num_processed_chars);
+
// read next line from input stream
m_line_buffer_tmp.clear();
std::getline(*m_stream, m_line_buffer_tmp, '\n');
@@ -8756,7 +8781,7 @@ basic_json_parser_66:
}
// set pointers
- m_content = reinterpret_cast<const lexer_char_t*>(m_line_buffer.c_str());
+ m_content = reinterpret_cast<const lexer_char_t*>(m_line_buffer.data());
assert(m_content != nullptr);
m_start = m_content;
m_marker = m_start + offset_marker; |
The standard says that the assign from iterators is valid, as it is equivalent to creating a temporary string from the iterators and assigning from that temporary:
|
@gregmarr, I see now. |
One more thing: diff --git a/src/json.hpp b/src/json.hpp
index 8704134..8ced3e7 100644
--- a/src/json.hpp
+++ b/src/json.hpp
@@ -8843,7 +8843,9 @@ basic_json_parser_66:
auto e = std::find(i, m_cursor - 1, '\\');
if (e != i)
{
- result.append(i, e);
+ for(auto k = i; k < e; k++) {
+ result.push_back(static_cast<typename string_t::value_type>(*k));
+ }
i = e - 1; // -1 because of ++i
}
else The original intent with using reinterpret cast was to hint the compiler: "hey, you can optimize moving a contiguous chunk of bytes from here to there instead of shuffling them one-byte-at-a-time". Unfortunately, that turned out to be a problem. |
@TurpentineDistillery I wasn't sure if assign was safe, so I went to the standard to check just before I posted that. :)
Which STL are you testing with? That sounds like it could be a point of variance among libraries. |
@gregmarr, very likely the case. |
@TurpentineDistillery @gregmarr Thanks for the further insights. I pushed them to the feature branch (see 2773038). |
@nlohmann, |
@TurpentineDistillery, I am using re2c 0.16. I once started on this myself - maybe you can recycle some code from branch feature/unfolded_string_to_int - but it's been a while since I worked on this. |
I looked at the code in that branch, and it looked basically what I envisioned it would look like if I were to make the changes, so I don't think there's anything else for me to do there. I reviewed the implementation of the parser. In the doc you expressed hope that LALR would be more efficient than recursive-descent. I'm not an authority on parsers, but I'm pretty sure that's not the case here. There's no place in the grammar where the recursive-descent may go "off-track" by more than one token, and that may happen only rarely. The above leads me to believe that the implementation of the parser is optimal from the theoretic standpoint. Elsewhere you said that you think that the lexer code is near-optimal, but looking at the expansion of the auto-generated re2c code I can neither confirm or deny that : ). re2c folks say " the generated code is as good as a carefully tuned hand-crafted C/C++ lexer", but I dunno... |
@TurpentineDistillery Thanks for checking back. I shall merge the branch. If anything else comes up, we can have a fresh look. There is one reason I would like to have an LALR parser: By using a parser generator such as Bison, the actual written code would be concise and easy to verify compared to the recursive descent. In that sense, I am happy to use re2c, because it is easy to check the used regular expressions. However, I haven't found any tool that generates code that is easy to put into a header-only library. At the same time, my confidence in the current parser is quite high. I had some deeper looks into the code generated by re2c when I tried to improve the code coverage. It is not straightforward, but in the end they work with a reduced DFA, and I haven't seen anything odd there. |
Merged: a8522f3. |
I propose some minor changes that yield substantial performance improvements.
The text was updated successfully, but these errors were encountered: