Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk file disappears from cache during sync #24653

Closed
Marginal opened this issue May 16, 2016 · 39 comments · Fixed by #24691
Closed

Chunk file disappears from cache during sync #24653

Marginal opened this issue May 16, 2016 · 39 comments · Fixed by #24691

Comments

@Marginal
Copy link

Marginal commented May 16, 2016

Syncing large files with the 2.1.1 (build 3107) client on OSX against ownCloud 9.0.2 on a slowish 384KB/s link I'm getting an "expected filesize 5242880 got 0" exception on about 5% of the chunk uploads. I've added some additional logging to OC\Files\View::file_put_contents() (see owncloud.log below) and it appears that apache is receiving the file chunk contents OK, ownCloud is writing the contents to the filesystem cache OK, but the chunk files are occasionally disappearing from the filesystem before OC_FileChunking::store() returns.

Any pointers on how to debug this gratefully received. On the assumption that the underlying filesystem isn't randomly losing files, my working hypothesis is that the cache is occasionally clearing up chunk files too aggressively while they're still being written.

Also, I'm aware of #9832 but that issue seems a mess of various problems, and of #11778 but this seems to be a different problem. Could be related to owncloud/client#4421.

Steps to reproduce

  1. Upload 6 100MB files with the ownCloud sync client over a slowish link.

Expected behaviour

Files sync OK.

Actual behaviour

Client reports "Error download https://xxx/xxx-chunking-xxx-y-z - server replied: Bad Request (expected filesize 5242880 got 0)"

Server configuration

Operating system: Linux

Web server: Apache 2.4.16

Database: MariaDB 10.1.14

PHP version: 5.5.32

ownCloud version: 9.0.2.2

Updated from an older ownCloud or fresh install: Updated incrementally from 5.0.7

Where did you install ownCloud from: https://download.owncloud.org/community/owncloud-9.0.2.zip

Signing status (ownCloud 9.0 and above):

No errors have been found.

List of activated apps:

Files
Gallery
Contacts
Calendar

The content of config/config.php:

<?php
$CONFIG = array (
  'debug' => false,
  'instanceid' => '51b4613a9b05a',
  'passwordsalt' => 'Redacted',
  'datadirectory' => '/home/marginal/owncloud-data',
  'dbtype' => 'mysql',
  'dbdriveroptions' => 
  array (
    1002 => 'SET wait_timeout = 3600',
  ),
  'version' => '9.0.2.2',
  'dbname' => 'marginal_owncloud',
  'dbhost' => 'localhost',
  'dbtableprefix' => 'oc_',
  'dbuser' => 'Redacted',
  'dbpassword' => 'Redacted',
  'installed' => true,
  'forcessl' => true,
  'logfile' => '/home/marginal/owncloud.log',
  'loglevel' => 0,
  'log_rotate_size' => 67108864,
  'maintenance' => false,
  'theme' => '',
  'trusted_domains' => 
  array (
    0 => 'Redacted',
  ),
  'secret' => 'Redacted',
  'share_folder' => '/Shared',
  'forceSSLforSubdomains' => true,
  'trashbin_retention_obligation' => 'auto',
);

Are you using external storage, if yes which one: Local

Are you using encryption: No

Are you using an external user-backend, if yes which one: None

Client configuration

Browser: Firefox 46.0.1

Operating system: OSX 10.10.5

Logs

Web server access log

x.x.x.x - - [17/May/2016:00:30:19 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-2 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:08 +0200] "GET /ocs/v1.php/cloud/activity?page=0&pagesize=100&format=json HTTP/1.1" 200 242 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:30:19 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-0 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:30:19 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-1 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:20 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 374 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:29 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 416 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:08 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-3 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:52 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 374 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:09 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-4 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:02 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 416 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:39 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-6 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:27 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 374 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:04 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-7 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:34 +0200] "PROPFIND /remote.php/webdav/ HTTP/1.1" 207 416 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:31:09 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-5 HTTP/1.1" 201 - "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"
x.x.x.x - - [17/May/2016:00:32:06 +0200] "PUT /remote.php/webdav/Test/100M-2-chunking-1949736465-20-8 HTTP/1.1" 400 225 "-" "Mozilla/5.0 (Macintosh) mirall/2.1.1"

ownCloud log (data/owncloud.log)

{"reqId":"VzpJ@23JklEAADcNxJsAAAAG","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-2.YQIY9G9r9RWnP7YS.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:06+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-2","user":"xxx"}
{"reqId":"VzpJ@23JklEAADVGWzcAAAAM","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-0.5MiXMy0jYPhWsfQG.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:07+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-0","user":"xxx"}
{"reqId":"VzpJ@23JklEAADRc0bsAAAAD","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-1.JxhlUUjsdiqlImLV.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:07+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-1","user":"xxx"}
{"reqId":"VzpKLG3JklEAADpXFl4AAAAI","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-3.9caCvTHfJ9mRqU9Z.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:31:37+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-3","user":"xxx"}
{"reqId":"VzpKLW3JklEAADbZQCoAAAAC","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-4.g8dA06k5JAF1iUzp.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:00+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-4","user":"xxx"}
{"reqId":"VzpKS23JklEAADhbAwwAAAAJ","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-6.3YH71eRiLUiVjh9C.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:04+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-6","user":"xxx"}
{"reqId":"VzpKZG3JklEAADnhLgIAAAAB","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-7.bmRfooGP900enSbt.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:28+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-7","user":"xxx"}
{"reqId":"VzpKLW3JklEAADcNxJ4AAAAG","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-5.2a6iZC56JUVnROmJ.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-16T22:32:37+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-5","user":"xxx"}
{"reqId":"VzpKZm3JklEAACfcAz8AAAAF","remoteAddr":"x.x.x.x","app":"PHP","message":"filesize(): stat failed for \/home\/marginal\/owncloud-data\/xxx\/cache\/100M-2-chunking-1949736465-8.adjGuN5SCj567wO6.part at \/home\/marginal\/public_html\/lib\/private\/files\/storage\/local.php#142","level":3,"time":"2016-05-16T22:32:53+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-8","user":"xxx"}
{"reqId":"VzpKZm3JklEAACfcAz8AAAAF","remoteAddr":"x.x.x.x","app":"file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-1949736465-8.adjGuN5SCj567wO6.part: exists: 0, readable: 0, size: 0","level":1,"time":"2016-05-16T22:32:53+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-8","user":"xxx"}
{"reqId":"VzpKZm3JklEAACfcAz8AAAAF","remoteAddr":"x.x.x.x","app":"webdav","message":"Exception: {\"Message\":\"HTTP\\\/1.1 400 expected filesize 5242880 got 0\",\"Exception\":\"Sabre\\\\DAV\\\\Exception\\\\BadRequest\",\"Code\":0,\"Trace\":\"#0 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php(103): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->createFileChunked(Resource id #320)\\n#1 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/directory.php(134): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->put(Resource id #320)\\n#2 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(1036): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\Directory->createFile('100M-2-chunking...', Resource id #320)\\n#3 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/CorePlugin.php(523): Sabre\\\\DAV\\\\Server->createFile('Test\\\/100M-2-chu...', Resource id #320, NULL)\\n#4 [internal function]: Sabre\\\\DAV\\\\CorePlugin->httpPut(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#5 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/event\\\/lib\\\/EventEmitterTrait.php(105): call_user_func_array(Array, Array)\\n#6 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(459): Sabre\\\\Event\\\\EventEmitter->emit('method:PUT', Array)\\n#7 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(248): Sabre\\\\DAV\\\\Server->invokeMethod(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#8 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/appinfo\\\/v1\\\/webdav.php(55): Sabre\\\\DAV\\\\Server->exec()\\n#9 \\\/home\\\/marginal\\\/public_html\\\/remote.php(138): require_once('\\\/home\\\/marginal\\\/...')\\n#10 {main}\",\"File\":\"\\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php\",\"Line\":406,\"User\":\"xxx\"}","level":4,"time":"2016-05-16T22:32:54+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-1949736465-20-8","user":"xxx"}

Browser log

n/a

@PVince81
Copy link
Contributor

clearing up chunk files too aggressively while they're still being written

I'd also suspect that.
Are you able to do a ls -l of the chunk file before it disappears ?
The way how cleanup works is that it sets the mtime of the chunk 24 hours in the future.

Then cleanup job is triggered at login time (login hook) and would discard all chunks which mtime is < now. Maybe the timestamp / timezone of the server is causing the mtime to be wrong ?

@Marginal
Copy link
Author

Marginal commented May 17, 2016

Are you able to do a ls -l of the chunk file before it disappears ?

Yes - the offending chunk files are definitely created on the filesystem, and are visible to a sftp client (I'm running on shared hosting).

Then cleanup job is triggered at login time (login hook) and would discard all chunks which mtime is < now.

I'm looking into this now. While the chunk file is being written to the mtime ~= now. The mtime is only set 24hrs into the future after the file handle has been closed, in OC\Cache\File::set.

@PVince81
Copy link
Contributor

The mtime is only set 24hrs into the future after the file handle has been closed, in OC\Cache\File::set.

Hmmmm, indeed. But then there shouldn't be a huge time window during the time the file is written and the touch operation is done.
Unless file_put_contents is slow and already creates the file entry while it's writing it ?
Chunks are usually 20 mb big, and if the transfer is slow...

Also the code uses a .part file while it's being written. But the gc code doesn't care about extension. This is correct because we might want to clear stray part chunks too.

Were you able to observe the time between the moment where the part file is being written, does it exist ?

You might be onto something, keep posting your findings 😄
Will do some local testing with a debugger, this looks fishy.

@PVince81
Copy link
Contributor

I did some debugging and found out that this file_put_contents from chunking is actually calling streamCopy in a loop, because the chunk contents is a file handle, not data.
When checking the timestamp of the file on-disk I noticed that for every fwrite the mtime of the file was changed. So it's a game of cat and mouse until the GC comes to eat the file if its mtime happens to be older than now.

CC @RealRancor @guruz

@PVince81
Copy link
Contributor

So basically there's a loop that copies the bytes 8k by 8k, and for every fwrite the timestamp gets adjusted. If one of them takes longer and at the same time a login hook is triggered, it will kill the chunk too early.

@PVince81 PVince81 added this to the 9.0.3-current-maintenance milestone May 17, 2016
@PVince81
Copy link
Contributor

@Marginal great debugging, thank you!

@PVince81
Copy link
Contributor

A possible fix would be to not use touch but do actual time calculation in the gc method.
Basically, instead of setting the chunk's mtime to the future we leave it be and do the gc on files where mtime < now - 24hours

@Marginal
Copy link
Author

Marginal commented May 17, 2016

Yep - I added some logging to OC\Cache\File::gc and I can see this happening:

{"reqId":"VzsdGG3JklEAAA0Z70YAAAAB","remoteAddr":"x.x.x.x","app":"OC\\Files\\View::file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-49666361-2.flUvxizYuEcTCGAx.part: exists: 1, readable: 1, size: 5242880","level":1,"time":"2016-05-17T13:31:42+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-2","user":"xxx"}
{"reqId":"VzsdWm3JklEAAA5505IAAAAN","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-2-chunking-49666361-4.ZzxwB0pcWur6mTEJ.part","level":1,"time":"2016-05-17T13:32:11+00:00","method":"PROPPATCH","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"VzsdWm3JklEAAA5505IAAAAN","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-3-chunking-1403625106-1.G5YnsFJG5LIfaI49.part","level":1,"time":"2016-05-17T13:32:11+00:00","method":"PROPPATCH","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"VzsdWm3JklEAAA5505IAAAAN","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-2-chunking-49666361-3.nZoppMlNyBs6FbOa.part","level":1,"time":"2016-05-17T13:32:11+00:00","method":"PROPPATCH","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x","app":"PHP","message":"filesize(): stat failed for \/home\/marginal\/owncloud-data\/xxx\/cache\/100M-2-chunking-49666361-3.nZoppMlNyBs6FbOa.part at \/home\/marginal\/public_html\/lib\/private\/files\/storage\/local.php#142","level":3,"time":"2016-05-17T13:32:13+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x,"app":"OC\\Files\\View::file_put_contents","message":"Wrote 5242880 bytes from Resource id #320 to cache\/100M-2-chunking-49666361-3.nZoppMlNyBs6FbOa.part: exists: 0, readable: 0, size: 0","level":1,"time":"2016-05-17T13:32:13+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x","app":"OC\\Files\\Cache\\Cache::remove","message":"Removing: cache\/100M-2-chunking-49666361-3, id: 66491","level":1,"time":"2016-05-17T13:32:14+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}
{"reqId":"VzsdNG3JklEAABPYdxEAAAAD","remoteAddr":"x.x.x.x","app":"webdav","message":"Exception: {\"Message\":\"HTTP\\\/1.1 400 expected filesize 5242880 got 0\",\"Exception\":\"Sabre\\\\DAV\\\\Exception\\\\BadRequest\",\"Code\":0,\"Trace\":\"#0 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php(103): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->createFileChunked(Resource id #320)\\n#1 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/directory.php(134): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\File->put(Resource id #320)\\n#2 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(1036): OCA\\\\DAV\\\\Connector\\\\Sabre\\\\Directory->createFile('100M-2-chunking...', Resource id #320)\\n#3 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/CorePlugin.php(523): Sabre\\\\DAV\\\\Server->createFile('Test\\\/100M-2-chu...', Resource id #320, NULL)\\n#4 [internal function]: Sabre\\\\DAV\\\\CorePlugin->httpPut(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#5 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/event\\\/lib\\\/EventEmitterTrait.php(105): call_user_func_array(Array, Array)\\n#6 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(459): Sabre\\\\Event\\\\EventEmitter->emit('method:PUT', Array)\\n#7 \\\/home\\\/marginal\\\/public_html\\\/3rdparty\\\/sabre\\\/dav\\\/lib\\\/DAV\\\/Server.php(248): Sabre\\\\DAV\\\\Server->invokeMethod(Object(Sabre\\\\HTTP\\\\Request), Object(Sabre\\\\HTTP\\\\Response))\\n#8 \\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/appinfo\\\/v1\\\/webdav.php(55): Sabre\\\\DAV\\\\Server->exec()\\n#9 \\\/home\\\/marginal\\\/public_html\\\/remote.php(138): require_once('\\\/home\\\/marginal\\\/...')\\n#10 {main}\",\"File\":\"\\\/home\\\/marginal\\\/public_html\\\/apps\\\/dav\\\/lib\\\/connector\\\/sabre\\\/file.php\",\"Line\":406,\"User\":\"xxx\"}","level":4,"time":"2016-05-17T13:32:14+00:00","method":"PUT","url":"\/remote.php\/webdav\/Test\/100M-2-chunking-49666361-20-3","user":"xxx"}

I guess the chunk file should be locked while being written?

@PVince81
Copy link
Contributor

I just realized that the set function is expecting a ttl here: https://github.com/owncloud/core/blob/v9.0.2/lib/private/cache/file.php#L100
This must be why it was implemented this way. So the approach of moving the ttl away won't work.
However I'm not aware of any code path that uses that $ttl argument.

@Marginal let's examine your idea about locking.
It looks like the code that calls file_put_contents already acquires locks.
But the gc one doesn't, so let's see if we can add lock acquiring there as well.

@PVince81
Copy link
Contributor

Hmmm, I went through with a debugger and found out that both file_put_contents from OC_FileChunking and also unlink from gc will acquire locks using the locking provider.

However from what I remember the locking provider was limited to work only on the "files/" entries, not the "cache/" entries. If we make it work on the latter as well, it should prevent such situations.

@Marginal
Copy link
Author

I can see that OC\Files\View::file_put_contents is creating locks, but I'm not seeing any "Could not cleanup locked chunk" debug messages from OC\Cache\File::gc so they're not being applied.

@PVince81
Copy link
Contributor

@Marginal this is why: #24672
You might want try with this patch, but I'm not 100% convinced because locking will cause some performance overhead, especially with the database locking provider.

@PVince81
Copy link
Contributor

PVince81 commented May 17, 2016

I'm not so happy about using locking here because in some environments the DB locking provider is likely to be much slower when uploading many chunks. For every chunks it needs to allocate locks and change them during the course of the file operation.

Also the locking approach isn't backportable to OC < 8.1 where locking wasn't available yet.

Let's list the possible approaches:

  1. Get rid of the $ttl argument in OC\Cache\File::set() and have the ttl value in the gc() method directly.
    • the good: no locking overhead, single ttl value for every cached file
    • the bad: this is private api, so breaking it might be fine, but still... backporting an API change isn't nice
  2. Enable locking for the "cache" folder (POC here [POC] [stable9] Also lock chunks to prevent concurrent cleanup #24672)
    • the good: locking makes sure that there are no concurrency issues
    • the bad: more performance overhead especially with DB locking, not backportable to OC < 8.1
  3. Add an extra 24 hours to the ttl in the gc function's checks
    • the good: solves the issue with API change, it's backportable to OC < 8.1
    • the bad: the ttl contract isn't respected since a file might stay 24 hours longer than expected

@icewind1991 @schiesbn @nickvergessen @DeepDiver1975 any insights ?

@Marginal
Copy link
Author

Thanks, I'll try #24672.

IMHO your solution 1 seems cleanest - it's weird having mtimes in the future, and even weirder relying on this.

@PVince81
Copy link
Contributor

@Marginal yep, I agree on the weirdness. This is very very old code so there might have been an reason in the past. I'll make a POC for solution 1) too so you can try it too on your magic env 😄

@PVince81
Copy link
Contributor

PVince81 commented May 17, 2016

Oh noes 😦 :

It looks like this is part of the public API, which means there could be apps using this and expecting the $ttl value to work.

Another idea I got was to add the expected mtime value into the file name, but then it makes it difficult to find that file name again when only the $key is known.

@PVince81
Copy link
Contributor

Okay, yet another idea: if we assume that the mtime of a finished chunk is always correct but the one of a part chunk might be wrong, then add an extra 24 hours tolerance for part chunks in the gc() method. This is especially relevant considering that part chunks cannot be found concurrently by `get()' as they aren't finished and have a unique id.

@Marginal
Copy link
Author

It looks like this is part of the public API, which means there could be apps using this and expecting the $ttl value to work.

I guess that explains the "mtimes in the future" implementation.

if we assume that the mtime of a finished chunk is always correct but the one of a part chunk might be wrong, then add an extra 24 hours tolerance for part chunks in the gc() method.

Kinda ugly to special-case .part files in gc, but if you don't want the overhead of locking the chunk files this might be the only practical solution.

@PVince81
Copy link
Contributor

Steps to reproduce the original issue

  1. Edit https://github.com/owncloud/core/blob/v9.0.2/lib/private/cache/file.php#L116 and add sleep(5); before the touch() operation
  2. Open cadaver with the current user on the console (every file operation triggers a login due to basic auth)
  3. Setup desktop client
  4. Upload a big file (~200 MB) with the desktop client
  5. While the upload is running, run ls several times in cadaver to trigger the login hook which will clean up the part file
  6. 💣 desktop client says upload failed, log says "Expected filesize X go 0"

@PVince81
Copy link
Contributor

I'd like @icewind1991's wisdom regarding locking overhead to make a decision between solution 2 (enable locking for "cache" folder) and solution 4 (extra expire time in GC for part files). For the backport to 8.0 which doesn't have the locking code, I'd go for 4 anyway.

@Marginal
Copy link
Author

Marginal commented May 17, 2016

Testing your solution 2 #24672:

{"reqId":"Vzsp-m3JklEAAAbfqkUAAAAB","remoteAddr":"x.x.x.x","app":"OC\\Cache\\File::gc","message":"Discarding: 100M-2-chunking-453979753-18.DsYu8Yn3IYL09Jmp.part","level":1,"time":"2016-05-17T14:26:07+00:00","method":"PROPFIND","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}
{"reqId":"Vzsp-m3JklEAAAbfqkUAAAAB","remoteAddr":"x.x.x.x","app":"core","message":"Exception when running cache gc: $absolutePath must be relative to \"files\"","level":2,"time":"2016-05-17T14:26:07+00:00","method":"PROPFIND","url":"\/remote.php\/caldav\/calendars\/xxx\/","user":"xxx"}

The sync is still proceeding happily 'though!

@PVince81
Copy link
Contributor

Exception when running cache gc: $absolutePath must be relative to "files"

Okay, looks like I might have missed something.
Thanks for testing!

@Marginal
Copy link
Author

Looks like there are assumptions that locked files only reside under /files in OC\Files\View::getPathRelativeToFiles and ::getUidAndFilename . Perhaps also elsewhere?

@PVince81
Copy link
Contributor

@icewind1991 popped up on IRC and suggested we just add - 1min in the gc code instead.

But this would assume that an fwrite of 8k would never take more than a minute, which is still a risky assumption.

@Marginal you're right, might need special handling for the cache folder 😦

@PVince81
Copy link
Contributor

Simplified fix with extra time tolerance is here: #24676

@PVince81
Copy link
Contributor

@Marginal if you say several minutes per chunk: one chunk is usually 20 MB I believe (or was it reduced to 10 ?). Several minutes, ok let's say 8 minutes which means 2.5 MB a minute, ~41 KB a second. So within that one second the fwrite method will be called more than 5 times, which means the mtime of the file will get updated 5 times within a second. Should be enough 😄

@Marginal
Copy link
Author

Marginal commented May 17, 2016

The default chunk size is now 5MB for the 2.1 client.

You're correct that the mtime is updated every 8KB, so 1hr leeway should be fine.

@PVince81
Copy link
Contributor

🍸

Let me know if the new PR works for you and we can move it forward to the fixing pipeline / backports, etc

@Marginal
Copy link
Author

#24676 works for me - I uploaded 6 x 100MB files with no "expected filesize 5242880 got 0" errors.

@PVince81
Copy link
Contributor

@Marginal thanks for testing!

master PR is here: #24691 (for 9.1)
I'll list the backport PRs in that one.

@guruz
Copy link
Contributor

guruz commented May 18, 2016

Awersome to see this fixed!

@PVince81
Copy link
Contributor

Yep. The reason why we didn't find it earlier is because it's hidden behind a $storage->file_put_contents() call which, at first appearance, should be atomic. However, due to the many layers of OC, that call is transformed to a loop that copies blocks which isn't atomic.

@PVince81
Copy link
Contributor

I'm going to link all other related tickets to this one for the solution.

For those of you arriving here from other tickets, please try this PR #24691 or one of the related backports #24691 (comment). Alternatively you can just wait for the next maintenance releases 😄

@PVince81
Copy link
Contributor

@RealRancor if you see people on the forums with the error "Expected filesize X got 0", you might want to direct them to this solution. Thanks 😄

@enoch85
Copy link
Member

enoch85 commented May 18, 2016

Finally! Thanks @Marginal and @PVince81 🎆 👍

@lock
Copy link

lock bot commented Aug 4, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants