Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow timestamps to be in epoch ms #850

Closed
felixbarny opened this issue Apr 17, 2018 · 24 comments
Closed

Allow timestamps to be in epoch ms #850

felixbarny opened this issue Apr 17, 2018 · 24 comments

Comments

@felixbarny
Copy link
Member

felixbarny commented Apr 17, 2018

Currently, timestamps in the intake API have to be a ISO date formatted string. Profiling has shown that this is actually the biggest cause of object allocations when serializing the payloads as JSON. Could we implement support for optionally transmitting the timestamp as a (64 bit) integer in epoch milliseconds?

This applies to Transaction.timestamp and Error.timestamp.

@beniwohli
Copy link
Contributor

Wow, that's surprising. I know that parsing iso datetimes can be a bottleneck (hence why we nailed down the format pretty strictly so we can just parse it with a regex), but creating a datetime string as well? In Python, we more or less just call the libc implementation of strftime, which is probably some of the most tuned code you can find.

To be honest, I'm not a huge fan of accepting multiple datetime formats. Also, an integer epoch doesn't carry any timezone information, so we can't be sure that it's UTC.

@watson
Copy link
Contributor

watson commented Apr 17, 2018

@beniwohli we could just make it a really clear requirement that it has to be UTC, but I agree that we increase the risk of getting it wrong if we don't state the timezone explicitly

@felixbarny
Copy link
Member Author

Also, an integer epoch doesn't carry any timezone information, so we can't be sure that it's UTC.

An epoch timestamp is always in UTC per definition.

Here is a screenshot of an allocation profile of the Java agent where you can see that one of the biggest cause of allocations is parsing the date string during JSON serialization:
screen shot 2018-04-17 at 09 22 40

@felixbarny
Copy link
Member Author

felixbarny commented Apr 17, 2018

This change would reduce the allocation rate overhead by about 30%, which brings allocation rates down to under 1kb per request (including the reporting of the transactions).

@beniwohli
Copy link
Contributor

An epoch timestamp is always in UTC per definition.

yup, you're right, I retract that argument. Sorry about that.

Still, my surprise remains. Compared to all the other things we do, transaction and error timestamps have a relatively low cardinality. Does your test suite do any non-trivial work or is it just serializing more or less empty transactions?

I'd like to avoid complicating the API due to a case of premature optimization, is all.

@felixbarny
Copy link
Member Author

felixbarny commented Apr 17, 2018

Does your test suite do any non-trivial work or is it just serializing more or less empty transactions?

In this benchmark, I'm measuring recording a HTTP request and a SQL query and reporting the transactions to a mock APM-server over HTTP. So there is no non-trivial work that the benchmark does because the point is to only measure the overhead imposed by the agent. Allocation rates are only one metric of this benchmark, but that simple change would reduce the allocation rate overhead by about 30%.

@felixbarny
Copy link
Member Author

Note that by default, the Java agent only captures stack traces when a span took longer than 5ms. Otherwise that would probably be the dominating factor. Also, I implemented ignore_urls in a way that does not allocate memory by not using regex pattern matching, which would otherwise allocate a lot of memory as well. So there is not much left which does allocate memory and one of those things happens to be the construction of the ISO timestamp String.

@beniwohli
Copy link
Contributor

My point is that if the agent doesn't do much more than creating an almost empty transaction, not doing those allocations looks like it has a huge impact. But if you measure a more realistic transaction with a couple dozen spans with contexts and frames and whatnot, the impact might suddenly look much smaller.

All I'm saying is that we should be vary of taking such drastic steps as changing the intake API for something that might just be a micro-optimization.

@felixbarny
Copy link
Member Author

felixbarny commented Apr 17, 2018

Keep in mind that we do report all transactions, even when not sampling them. In order for this to work at scale, we have to bring the overhead for creating and reporting a non-sampled transaction down to the bare minimum IMHO.

Also, creating transactions and spans involve close to zero allocations, so it does not really matter how many spans there are in a transaction. Note that in this benchmark, the context is fully propagated with http context in the transaction and db context in the span, including the sql statement, so it is somewhat realistically.

@felixbarny
Copy link
Member Author

Also, creating transactions and spans involve close to zero allocations, so it does not really matter how many spans there are in a transaction.

This is not quite accurate as monitoring SQL queries currently does allocate some memory (see elastic/apm-agent-java#35)

@roncohen
Copy link
Contributor

That is indeed surprising.

Googling a bit, it seems DateTimeFormatter is not very efficient. Could we investigate an alternative implementation?

@felixbarny
Copy link
Member Author

There is a newer date API in Java. Might be that the date parser is more efficient in this API. But it is only available in Java 8+ and we support Java 7 applications.

Not sure what you mean with event.toISOString().

Using an epoch timestamp in an API seems like quite a normal thing to do and I'd even consider it more robust that parsing strings. It would also be a backwards compatible change if we support both in the JSON schema. Also, that doesn't seem to add a lot of complexity. So I can't see a lot of downsides to this approach.

@roncohen
Copy link
Contributor

I think it's important to differentiate between whether generating ISO dates are inherently costly operation or if the Java formatter is inefficient. I'd like to avoid making changes to the API to work around performance problems in different languages' standard libraries. If it turns out more agents have the same problem, then it's another story.

By the way, this is about formatting, not parsing, as that happens on the APM Server, right? @felixbarny

@watson
Copy link
Contributor

watson commented Apr 18, 2018

For reference here's the difference in speed for generating an ISO string vs getting the current UTC time in milliseconds in Node.js/v8:

new Date().toISOString() x 1,391,237 ops/sec ±1.58% (80 runs sampled)
Date.now() x 8,644,451 ops/sec ±1.19% (82 runs sampled)
Fastest is Date.now()

The test was run on my slow laptop. While it looks like ms is around 6 times faster than ISO, we have to remember that I'm still able to generate over a million ISO strings per second, so it's not worth optimizing for my sake.

@felixbarny
Copy link
Member Author

I think it's important to differentiate between whether generating ISO dates are inherently costly operation or if the Java formatter is inefficient.

Generating a ISO date string is inherently more costly than just having an epoch timestamp in probably any programming language. Generating a raw timestamp does not require heap allocations in contrast to converting this timestamp to a ISO string representation. Of course it is debatable if it actually makes a difference. But just looking at the average latency values does not tell the whole picture:

The problem is that when we are allocating objects, we put additional pressure on the GC which means that the GC will run more frequently with potentially longer pauses. That, in turn, means we don't have a predictable overhead of X ms per request. Due to the added GC pressure, the user requests are more frequently affected by GC. You typically don't see this in average response time values but in the higher percentiles like the 99th or 99.99th percentile latency. If we were just monitoring a batch job where only throughput is important, this would not matter. But latency is very important for a typical web application, as we don't want a portion of users having to wait for a much longer time when the application is monitored by our agents.

That's the reason why I'm constantly arguing for having as few allocations as possible. It's not about having a low average overhead but rather about having a predictable overhead (even if it the average would be slightly higher) which does not greatly affect the higher latency percentiles.

@watson
Copy link
Contributor

watson commented Apr 20, 2018

Just a few findings that might be relevant for this discussion:

I was just reading up on the User Timing API in browsers as we use some of those timings as the source for marks in the intake API.

As far as I understand it, the User Timing API is based on the Performance Timeline spec by W3C (which actually also is implemented as an experimental API in Node.js).

According to the W3C spec, a mark is an instance of PerformanceEntry, which in turn uses DOMHighResTimeStamp to store both absolute time and time deltas (startTime and duration respectively).

The DOMHighResTimeStamp type is a double and is used to store a time value. The unit is milliseconds and should be accurate to 5 µs (microseconds).

So here they did decide on using a number to store the timestamp and not a string. Just thought that was interesting.

@felixbarny
Copy link
Member Author

After merging elastic/apm-agent-java#69, this is basically the only source of object allocations during JSON serialization and there is no way around of allocations when formatting an ISO date string.

@felixbarny
Copy link
Member Author

felixbarny commented May 1, 2018

I have now created a dashboard with the benchmark results:
screen shot 2018-05-01 at 09 23 52

Note the significant drop in allocations (from 473 to 82kb/op) when serializing timestamps as an epoch timestamp. Also, the ops/s (Score) increases.

@beniwohli
Copy link
Contributor

@felixbarny I'm not doubting that you see a significant impact in an artificial benchmark. What I'm wondering is if this impact carries over to a more real-world scenario. If there is a measurable impact when running the agent in an app that actually does something, then it might be worth changing the API over it. But if it vanishes, I'm not so sure anymore.

I hope you don't take my hesitance over this API change the wrong way. I mostly want to avoid softening up the API spec for micro-optimizations in this or that agent, as any change will be set more or less in stone.

@axw
Copy link
Member

axw commented May 1, 2018

FWIW, I'm not a fan of the datetime string-formatting either, for the same reasons. The only benefit to string formatting I can see is that a human can more easily read the value, but I don't think that's a strong enough reason to impose additional processing/memory/GC overhead.

Having said that, this is not a significant issue in Go, since there's a garbage-free way of formatting timestamps: time.Time.AppendFormat. I'd suggest writing a specialised/optimised formatter for Java, which deals only with the specific format we care about.

If we introduce a binary format, let's please make the timestamp an integer.

@felixbarny
Copy link
Member Author

If there is a measurable impact when running the agent in an app that actually does something, then it might be worth changing the API over it. But if it vanishes, I'm not so sure anymore.

Sure, if the agent is deployed in an application which is rather wasteful in terms of memory, you probably won't notice the difference. But we don't have control over the application the agent is used in and I'm sure our agents will also be used in well-optimized applications. All we have control over is how much overhead the agent adds to the application per request (which is what the benchmark is measuring - just the bare overhead of the agent) and I think it's worth aiming at the lowest overhead possible. For most languages, it is non-trivial to create an ISO formatted date timestamp in a performant and garbage-free way, whereas just writing an epoch timestamp is the most simple thing to do.

I hope you don't take my hesitance over this API change the wrong way.

I really don't - I appreciate your feedback and I know that API changes are not to be taken lightly. But I think that an epoch timestamp is a better choice in API design anyway and that we would only gain from this change.

@felixbarny
Copy link
Member Author

With elastic/apm-agent-java#77 I wrote a custom formatter which can serialize ISO timestamps in a garbage free and performant way.

So I'm OK with leaving the timestamp as is but would still suggest using an epoch timestamps in future APIs like intake v2 or the metrics API.

@roncohen
Copy link
Contributor

roncohen commented Jul 2, 2018

I'm happy you found a way to work around it @felixbarny. I'm going to close this now because it's no longer a problem in the agents. If you run into problems again related to this, feel free to reopen.

@roncohen roncohen closed this as completed Jul 2, 2018
@zube zube bot added [zube]: Done and removed [zube]: Inbox labels Jul 2, 2018
@zube zube bot reopened this Jul 2, 2018
@zube zube bot added [zube]: Inbox and removed [zube]: Done labels Jul 2, 2018
@zube zube bot closed this as completed Jul 2, 2018
@zube zube bot added [zube]: Done and removed [zube]: Inbox labels Jul 2, 2018
@felixbarny
Copy link
Member Author

To follow up, this has now been implemented via #1413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants