Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError and NegativeArraySizeException in DiskBasedCache.streamToBytes #12

Closed
joebowbeer opened this issue Feb 14, 2017 · 13 comments
Milestone

Comments

@joebowbeer
Copy link
Contributor

DiskBasedCache may fail repeatedly if a cached header is corrupt. In the event of failure, the most common error is OutOfMemoryError thrown from streamToBytes, caused by trying to allocate a huge number of bytes resulting from the bogus length calculated due to the corruption.

Once a header has been corrupt, the cache initialization will continue to fail, and the errant header will not be removed.

A less common failure was NegativeArraySizeException, but that was recently patched.

DiscBasedCache needs to be updated to guard against corruption.

Related to:

https://code.google.com/p/android/issues/detail?id=230219
mcxiaoke/android-volley#141 (mirror)
http://stackoverflow.com/a/42196956/901597

@joebowbeer
Copy link
Contributor Author

joebowbeer commented Feb 20, 2017

More about this bug:

Almost all of the OOME failures happen while reading the string-string map of headers during initialize.

Below is a typical stack trace, from https://code.google.com/p/android/issues/detail?id=230219

Exception java.lang.OutOfMemoryError:
  com.android.volley.toolbox.DiskBasedCache.streamToBytes (DiskBasedCache.java:323)
  com.android.volley.toolbox.DiskBasedCache.readString (DiskBasedCache.java:541)
  com.android.volley.toolbox.DiskBasedCache.readStringStringMap (DiskBasedCache.java:563)
  com.android.volley.toolbox.DiskBasedCache$CacheHeader.readHeader (DiskBasedCache.java:404)
  com.android.volley.toolbox.DiskBasedCache.initialize (DiskBasedCache.java:157)
  com.android.volley.CacheDispatcher.run (CacheDispatcher.java:84)

There are also a few rare instances where OOME is thrown from this same location in DiskBasedCache, but on behalf of a get request. I assume these are thrown when an app writes a bad entry to the cache then later tries to retrieve the entry from the cache during the same execution, before the cache has been re-initialized.

A bad entry may also manifest as a NegativeArraySizeException (NASE) when trying to get the entry. These exceptions are relatively rare and occur after the header map has been read. They result when the excessively large number of bytes read is subtracted from the file length and then used as an array size. If this value is negative, NASE is thrown. (Theoretically it's possible that a bad entry could go undetected.)

A nasty property of these exceptions is that they persist. Once the problem occurs, it will occur every time the cache is initialized. This is why some additional catch-all exception handling is needed: preferably per entry, serving to remove the errant entry if an exception is thrown when reading or writing it. Otherwise, something else is needed in order to prevent these bad cache entries from crashing the app. One developer recommends the use of an uncaught exception handler that will clean the entire cache in the event of an uncaught OOME or NASE. See https://github.com/anyercastillo/volley-crash

Let's go back to the original bug report in 2013:

https://code.google.com/p/android/issues/detail?id=61406

The reporter attached a bad cache entry to the bug report.

The problem occurs after the "Content-Length: 2574" header is serialized.

This string pair should be serialized as:

0x0e 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Content-Length
0x04 0x00 0x00 0x00 0x00 0x00 0x00 0x00 2574
0x07 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Expires // beginning of next pair

But in this case it was serialized as:

0x0e 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Content-Length
0x04 0x00 0x00 0x00 0x00 0x00 0x00 0x00 2574
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x32 0x35 0x37 0x34 // .......2574
0x07 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Expires // beginning of next pair

Here's what happens after the "Content-Length" and "2574" strings are read:

  1. The next 8 bytes are read: 0x3200000000000000L.
  2. This long is cast to an int to be used as the byte-length for the next encoded string key.
  3. (int)0x3200000000000000 is 0, so no bytes are read, resulting in a bogus "" key.
  4. The next 8 bytes are read: 0x0000000007343735L.
  5. This long is cast to int to be used as the byte-length for the next encoded string value.
  6. (int)0x0000000007343735 is 120 million, triggering an allocation attempt that fails.
  7. OOME!

Open questions:

Why are the string byte array lengths serialized as longs?
How are these bad headers written in the first place?

@jpd236
Copy link
Collaborator

jpd236 commented Feb 22, 2017

That report is quite old, and seems like it might have been addressed by https://android-review.googlesource.com/#/c/99201/. Are there more recent reports of this happening?

@joebowbeer
Copy link
Contributor Author

joebowbeer commented Feb 22, 2017

The change you refer to was a cache-all on write header. It seems to have no impact on the thousands of OOME crashes that are happening now.

The first stack trace I referenced in my previous comment is very new, but is essentially identical to the report from 4 years ago.

@jpd236
Copy link
Collaborator

jpd236 commented Feb 23, 2017

OK, fair - but I think that the example entry above might have been affected by that bug. Hard to tell given how old it is.

I can't help but wonder, from reading things like http://stackoverflow.com/questions/24095909/volley-out-of-memory-error-weird-allocation-attempt/42196956#42196956, whether this might be to "misuse" of Volley/DiskBasedCache rather than corruption happening out of the control of the app. There are many apps that use Volley and have never seen this issue before, whereas the ones that do see it see it frequently from what I'm gathering. That does seem to suggest a usage pattern that might be causing the problem.

It is very certainly the case that DiskBasedCache is not and does not purport to be thread safe (or process safe). If you're trying to write the same directory location in multiple threads or processes that could absolutely cause bad things to happen.

Would be good to get more info... something like an example app w/ source that exhibits this problem.

But it's at least true that the cache can be cleared by the user at any time, which is something the app doesn't control, and guarding against at least that is reasonable. I just want to have some confidence that we're hitting the real issue here and not papering over a more fundamental cause than random disk corruption. (100 times per day for an app does not sound like disk corruption, at all).

@jpd236
Copy link
Collaborator

jpd236 commented Feb 23, 2017

I wonder whether a combination of a FileLock and some sort of thread lock on the cache directory could root out that kind of thing in a more obvious manner.

Note that Volley only accesses the cache in a single thread (CacheDispatcher), so it doesn't have to worry about thread safety. But if you have multiple request queues, or multiple processes, as a few of these reports seem to suggest were the case, then you'll have problems.

@joebowbeer
Copy link
Contributor Author

OK, fair - but I think that the example entry above might have been affected by that bug. Hard to tell given how old it is.

Given that the stack traces I've seen are all similar, and that no code has been added that I believe really addresses this problem, I assume the old repro case still applies.

I can't help but wonder, from reading things like http://stackoverflow.com/questions/24095909/volley-out-of-memory-error-weird-allocation-attempt/42196956#42196956, whether this might be to "misuse" of Volley/DiskBasedCache rather than corruption happening out of the control of the app. There are many apps that use Volley and have never seen this issue before, whereas the ones that do see it see it frequently from what I'm gathering. That does seem to suggest a usage pattern that might be causing the problem.

Once the problem occurs, it recurs every time the cache is initialized. This tends to amplify the problem in the apps where the problem occurs.

I saw the suspicion about misuse and this occurred to me as well, but I can't find a concrete case.

I think a usage pattern or data pattern is more likely. Perhaps there is a flaw in the DiskBasedCache implementation that is only hit by some apps? For example, the mapping from key to file name is pretty poor. What if two keys map to the same file? Currently, DiskBasedCache has barely any unit tests, and testing for the proper handling of key collision is not one of them.

It is very certainly the case that DiskBasedCache is not and does not purport to be thread safe (or process safe). If you're trying to write the same directory location in multiple threads or processes that could absolutely cause bad things to happen.

Every overridden method is synchronized, so I think DBC does intend to be thread safe, and I don't see any obvious flaws in that aspect of its implementation. I have no evidence that any of the apps encountering this problem are using multiple processes.

@jpd236
Copy link
Collaborator

jpd236 commented Feb 23, 2017

I saw the suspicion about misuse and this occurred to me as well, but I can't find a concrete case... I have no evidence that any of the apps encountering this problem are using multiple processes.

http://stackoverflow.com/a/38106070 "The main application process and a 'SyncAdapter' process following the sync adapter pattern. Only the app with two processes would crash. These two processes would independently initialize the cache."

mcxiaoke/android-volley#61 "you can check your reqeustQueue is Singleton, I has solved the problem that is OOM caused by context leak with every context init a requestQueue."

For example, the mapping from key to file name is pretty poor. What if two keys map to the same file? Currently, DiskBasedCache has barely any unit tests, and testing for the proper handling of key collision is not one of them.

Even that were to happen, I don't see how it would lead to corruption in this manner, as opposed to an entry in the cache that had unexpected (but well-formed) data, given the locks you pointed out.

Every overridden method is synchronized, so I think DBC does intend to be thread safe, and I don't see any obvious flaws in that aspect of its implementation. I have no evidence that any of the apps encountering this problem are using multiple processes.

You're right - sorry; it is indeed the case that multiple NetworkDispatcher threads as well as the CacheDispatcher thread are accessing the cache in parallel. However, the only ensured thread safety is for a single Cache instance. If you're creating more than one instance in a process, or more than one instance across multiple processes, and those both point to the same location on disk, then you'll run into issues like this, and that's exactly what both the cases noted above were doing.

@joebowbeer
Copy link
Contributor Author

It is not clear that the two comments referred to above had any relation to the reporter's problem. In the first case, another answer was marked as correct.

In my experience with this bug, there are not and never were AFAICT multiple processes. However, I agree that multiple processes is one of the easiest root causes to imagine.

Other causes I can imagine:

  1. Incomplete cleanup (e.g., on update or replacement or trim). There are places in the code where the file is deleted before the stream is closed. These look suspicious and I recommend they be fixed.

  2. A bad output buffer implementation, such as a poorly implemented cache in the buffer. Adding unit tests might reveal a problem here.

  3. Interference from another crash? If the app crashes or is terminated for some other reason while an entry is being written, can this leave a bad entry on disk? More rigorous exception handling and recovery may help.

Once the problem has occurred, by means that remain a mystery after 4 years, the exception that results is surprisingly stable. It manifests either as a string byte array size that is too large, or in rarer cases as a negative remaining data size (now fixed).

Focusing in on the location of the exception itself, there is one small change that would help rid the cache of bad entries. These lines in the readString method:

int n = (int) readLong(is);
byte[] b = streamToBytes(is, n);

Should be:

long n = readLong(is);
if (n < 0 || n > fileLength) {
  throw new IOException("string length=" + n + " file length=" + fileLength);
}
byte[] b = streamToBytes(is, (int) n);

Every time a header is read or an entry is fetched from cache, file.length() can serve as an effective upper bound for the serialized lengths. A little refactoring is needed in order to utilize it.

@jpd236
Copy link
Collaborator

jpd236 commented Feb 23, 2017

Multiprocess might be rare but mutlithread is certainly not, and I could definitely see bugs where multiple request queues are being created.

Agreed that delete() before close() doesn't seem right and should probably be fixed. The other possibilities feel unlikely to me.

I think I'm fine with adding a length check there as a sanity measure. And of course, I'm fine with more unit tests.

@joebowbeer
Copy link
Contributor Author

In the process of adding more tests to DiskBasedCacheTest, I observe that the NegativeArraySizeException (NASE) handling in the get() method is not actually removing the entry. Apparently this is because the finally handler is closing the file input stream after an attempt is made to delete the underlying file. If the failure in get() does not actually remove the file, a subsequent initialize() should also fail, crashing the app because there is no handler for NASE in the initialize() method. It appears that neither NASE nor the more prevalent OOME are being handled in either get() or initialize().

@joebowbeer joebowbeer changed the title OutOfMemoryError in DiskBasedCache.streamToBytes OutOfMemoryError and NegativeArraySizeException in DiskBasedCache.streamToBytes Jul 25, 2017
@joebowbeer
Copy link
Contributor Author

Fixed by PR #52

@tbohnen
Copy link

tbohnen commented Aug 1, 2017

Hi, sorry to polute the thread however I can't see anywhere else to resolve this. I can't seem to find any documentation on how to actually get the update. I have 1.0.0 and had the below error:

DiskBasedCache.java line 323 in DiskBasedCache.streamToBytes()
DiskBasedCache.java line 119 in DiskBasedCache.get()
CacheDispatcher.java line 100 in CacheDispatcher.run()

@jpd236
Copy link
Collaborator

jpd236 commented Aug 1, 2017

Please see #11 (comment) for how to use a snapshot build if you'd like to pick up this fix. We haven't done an official release yet as there are a few more fixes I'd like to get in before we do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants