Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OC_Filecache is blowing Up!!!! #16834

Closed
kaoschris opened this issue Aug 22, 2019 · 23 comments
Closed

OC_Filecache is blowing Up!!!! #16834

kaoschris opened this issue Aug 22, 2019 · 23 comments
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap bug needs info stale Ticket or PR with no recent activity

Comments

@kaoschris
Copy link

Steps to reproduce

Import VM into VAPcenter VMware and Setup
Setup an Fileserver (windows) with 1 Million Files
Konfig Nextcloud to use External CIFS Shares from Windows
Import LDAP User from Windows

Expected behaviour

It works like it should but see below
Actual behaviour

there are select request hold open for a long time in nextcloud_db
Server configuration

Nextcloud server version: (see your admin page)
16.02
Server OS (Ubuntu server is default)
Ubuntu 18.04
How did you install the VM? (Scripted install from master OR Released version)
From Master
Network

Do you use DHCP?
Yes
Is port 80 and/or 443 open?
Yes
Logs / Screenshots

https://fileshare.kelobit.de/s/WLfJWbq2B246aNa

@kaoschris kaoschris added 0. Needs triage Pending check for reproducibility or if it fits our roadmap bug labels Aug 22, 2019
@kesselb
Copy link
Contributor

kesselb commented Aug 22, 2019

Is there anything in your nextcloud.log? Looks like the cron is running 3 times. Would you mind to add the other information from the issue template like configuration (without credentials) and apps?

@MichaIng
Copy link
Member

I am wondering why oc_file_locks and especially oc_storages tables are THAT huge.
To compare with mine:

-rw-rw---- 1 mysql mysql  64K Dec  4  2018 oc_file_locks.ibd
-rw-rw---- 1 mysql mysql 220M Aug 24 01:48 oc_filecache.ibd
-rw-rw---- 1 mysql mysql  64K Dec  7  2018 oc_storages.ibd
# mysql -e 'select * from nextcloud.oc_file_locks'
# mysql -e 'select * from nextcloud.oc_storages'
+-------------------------+------------+-----------+--------------+
| id                      | numeric_id | available | last_checked |
+-------------------------+------------+-----------+--------------+
| home::Micha             |          3 |         1 |         NULL |
| local::/mnt/sda/ncdata/ |          4 |         1 |         NULL |
| home::<User2>           |          5 |         1 |         NULL |
+-------------------------+------------+-----------+--------------+
  • oc_file_locks is empty and unused, which makes sense since I use Redis for file locking. @kaoschris I see you have a Redis process up as well, so it seems Nextcloud is not properly configured to use it. Database file locking causes a huge additional load, so I could imagine that resolving this could already fix your issue.
  • But what I am completely confused about is the >50M oc_storages table. This should only contain:
    • 1 table for local storage
    • 1 table for each user
    • 1 table for each external storage

So yeah, the database seems to be stuck/looping, adding entries multiple times.

Besides filling the infos from the default issue template (check postgres logs as well), shutting down webserver and waiting for or killing the running postgres processes, then checking some lines of the oc_storages table could give a hint, e.g.:

select * from nextcloud.oc_storages limit 20;
select * from nextcloud.oc_storages limit 20 offset 100;
  • To get the first 20 entries and the ones 101 - 120, to probably derive a pattern.

And try to replicate with a fresh instance with Redis-based file locking properly enabled first (which we could derive from your config.php settings as well).

@enoch85
Copy link
Member

enoch85 commented Aug 24, 2019

@MichaIng wrote:

so it seems Nextcloud is not properly configured to use it

It is done by the book. Never tested with 1 million files though.

This is the config in the VM:

  'memcache.local' => '\\OC\\Memcache\\Redis',
  'filelocking.enabled' => true,
  'memcache.distributed' => '\\OC\\Memcache\\Redis',
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'redis' => 
  array (
    'host' => '/var/run/redis/redis-server.sock',
    'port' => 0,
    'timeout' => 0.5,
    'dbindex' => 0,
    'password' => 'RANDOMPASSWORD',
  ),

And this is the install script: https://github.com/nextcloud/vm/blob/master/static/redis-server-ubuntu.sh

@MichaIng
Copy link
Member

@enoch85
Ah didn't see that this is an official Nextcloud VM install. Couldn't find anything wrong in install script/config.

If I remember right, memcache.locking fails silently (not causing any access issues) and then is reverted to database locking automatically, compared to memcache.local, which will cause a 500, if failing.

However logs (issue template info) required to investigate. Additionally, since Redis seems to be involved:

journalctl -u redis-server
  • Optionally with | tail -30 if output is too much.

@kaoschris
Copy link
Author

I am on it to bring you the logs, i need some assistance from our customer to bring you that Informations, thank you in advance for your help.. I will update this asap.

@MichaIng
Copy link
Member

MichaIng commented Aug 26, 2019

@kaoschris
If you have no direct access to the affected system (only via customer) then at best take the following logs all together (anonymized of course):

  • Nextcloud
  • Webserver
  • PHP (php-fpm I think?)
  • Redis
  • MariaDB/database
  • dmesg to check if some disk I/O or other system/kernel level errors are related

This should be the whole access order, thus all parts that can play a role.

@enoch85
Copy link
Member

enoch85 commented Aug 26, 2019

@kaoschris I just added APCu to the VM. If you want to use it, please create a file called wtaherver.sh and add this content:

. <(curl -sL https://raw.githubusercontent.com/nextcloud/vm/master/lib.sh)

# Enable igbinary for PHP 
# https://github.com/igbinary/igbinary
if is_this_installed "php$PHPVER"-dev
then
    if ! yes no | pecl install -Z igbinary
    then
        msg_box "igbinary PHP module installation failed"
        exit
    else
        print_text_in_color "$IGreen" "igbinary PHP module installation OK!"
    fi
{
echo "# igbinary for PHP"
echo "extension=igbinary.so"
echo "session.serialize_handler=igbinary"
echo "igbinary.compact_strings=On"
} >> $PHP_INI
restart_webserver
fi

# APCu (local cache)
if is_this_installed "php$PHPVER"-dev
then
    if ! yes no | pecl install -Z apcu
    then
        msg_box "APCu PHP module installation failed"
        exit
    else 
        print_text_in_color "$IGreen" "APCu PHP module installation OK!"
    fi
{
echo "# APCu settings for Nextcloud"
echo "extension=apcu.so"
echo "apc.enabled=1"
echo "apc.shm_segments=1"
echo "apc.shm_size=32M"
echo "apc.entries_hint=4096"
echo "apc.ttl=0"
echo "apc.gc_ttl=3600"
echo "apc.mmap_file_mask=NULL"
echo "apc.slam_defense=1"
echo "apc.enable_cli=1"
echo "apc.use_request_time=1"
echo "apc.serializer=igbinary"
echo "apc.coredump_unmap=0"
echo "apc.preload_path"
} >> $PHP_INI
restart_webserver
fi

Run the file with sudo bash whatever.sh

Then change 'memcache.local' => '\\OC\\Memcache\\Redis', to 'memcache.local' => '\\OC\\Memcache\\APCu', in /var/www/nextcloud/config/config.php

@MichaIng
Copy link
Member

MichaIng commented Aug 26, 2019

@enoch85

Then change 'memcache.local' => '\\OC\\Memcache\\Redis', to 'memcache.local' => '\\OC\\Memcache\\Redis', in /var/www/nextcloud/config/config.php

I guess you mean:

to 'memcache.local' => '\\OC\\Memcache\\APCu',

😉

@enoch85
Copy link
Member

enoch85 commented Aug 26, 2019

Yes ofc :D

@kaoschris
Copy link
Author

Hello i got some logs and uploaded them into https://fileshare.kelobit.de/s/WLfJWbq2B246aNa
under the folder "logs"
@enoch85 what does this User Cache do? And could it help in that case? And furthermore does it come with your regular update script aswell? Thanks in advance..

@kesselb
Copy link
Contributor

kesselb commented Aug 27, 2019

@kaoschris please remove sensitive information from the logs (e.g. account names, paths) ...

@MichaIng
Copy link
Member

@kesselb
Indeed, wasn't there a method to print the log a bid formatted with sensitive information replaced by **** via occ command?

However I just had a quick view:

"Message":"Failed to connect to the database: An exception occurred in driver: SQLSTATE[08006] [7] FATAL:  the database system is shutting down
"message":"Could not find resource core\/vendor\/marked\/marked.min.js to load"
"message":"unlink(\/mnt\/ncdata\/appdata_*****\/css\/core\/*****-css-variables.css): No such file or directory at \/var\/www\/nextcloud\/lib\/private\/Files\/Storage\/Local.php#227"
  • Those kind of messages occur repeatedly in nextcloud.log.
  • Clearly a database issue.
  • There are also other entry that access to external storages (SMD and such) failed, but dated some days earlier, so I probably not related.
  • Redis server reports changes, so it seems to be used, although the changes are suspicious few, probably memcache.local only, but not file locking? 🤔

Some days earlier (old nextcloud.log.1):

"args":["An exception occurred while executing 'INSERT INTO \"oc_file_locks\" (\"key\", \"lock\", \"ttl\") VALUES(?, ?, ?) ON CONFLICT DO NOTHING' with params [\"files\\\/7ea46b601bdb7f5120e1218085d108f8\", -1, 1564917292]:\n\nSQLSTATE[53100]: Disk full: 7 ERROR:  could not access status of transaction 0\nDETAIL:  Could not write to file \"pg_subtrans\/0052\" at offset 147456: No space left on device.",{"errorInfo":["53100",7,"ERROR:  could not access status of transaction 0\nDETAIL:  Could not write to file \"pg_subtrans\/0052\" at offset 147456: No space left on device."]
  • Database locking is used, which should not be the case if Redis is up. Redis logs do not include this early date.
  • More importantly, it fails because of missing drive space. I guess this can easily lead to follow up issues.

And in database, current entries:

2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db ERROR:  duplicate key value violates unique constraint "oc_credentials_pkey"
2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db DETAIL:  Key ("user", identifier)=(B557DE07-CB7A-4F36-BF72-C44429C6F3D6, password::logincredentials/credentials) already exists.
2019-08-26 15:54:54.558 CEST [22776] ncadmin@nextcloud_db STATEMENT:  INSERT INTO "oc_credentials" ("user", "identifier", "credentials") VALUES($1, $2, $3)
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db ERROR:  deadlock detected
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db DETAIL:  Process 20573 waits for ShareLock on transaction 12499777; blocked by process 33429.
	Process 33429 waits for ShareLock on transaction 12499711; blocked by process 20573.
	Process 20573: INSERT INTO "oc_file_locks" ("key", "lock", "ttl") VALUES($1, $2, $3) ON CONFLICT DO NOTHING
	Process 33429: DELETE FROM "oc_file_locks" WHERE "ttl" < $1
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db HINT:  See server log for query details.
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db CONTEXT:  while inserting index tuple (0,2) in relation "oc_file_locks"
2019-08-26 20:06:37.394 CEST [20573] ncadmin@nextcloud_db STATEMENT:  INSERT INTO "oc_file_locks" ("key", "lock", "ttl") VALUES($1, $2, $3) ON CONFLICT DO NOTHING
  • Also shows database locking is used.
  • Moreover duplicate database entries are detected. These errors appear on different database tables, including sensitive ones like login credentials as can be seen above.

Okay so there is clearly much messed up already:

  • First of all the server should be completely stopped at first (webserver, Redis, PostgreSQL) to further investigate and repair.
  • Assure that all drives which hold database and data have enough free space. Although related log entries are older, probably not an issue anymore.
  • If possible a database dump/backup should be created, if possible. In case of corruption this is sometimes not possible.
  • I don't know much about postgres but there should be some way to check & repair tables, AFTER trying the backup/dump.
  • Redis server state should be verified, although, as current logs show no errors but that it is running and used, this seems to be no issue on Redis side.
  • Enabling APCu @kaoschris is no must, does not necessarily affect this particular issue, but it's faster for local caching and allows to further check, if Redis is used for file locking as well. E.g. you switch local cache to APCu, do some file transfer/changes and check if redis-server.log contains some current entries like X changes in Y seconds. Saving... with X bring not 0, so being used.
  • On postgres, while webserver/Nextcloud is down, I would also clear the oc_file_locks, and after doing some file changes, check back if this table stays empty.

If the above is done, local tests work as expected, I would run some occ maintenance:repair to cleanup database and force some rescans, before opening the instance for general use. Also occ files:scan --all makes sense, since most likely not all files have been successfully added to database.

@enoch85
Copy link
Member

enoch85 commented Aug 27, 2019

run some occ maintenance:repair

@kaoschris In the VM that's occ_command maintenance:repair

@MichaIng
Copy link
Member

@enoch85
Should this issue probably moved to the VM repo? I don't think it is a general VM bug, more particular edge case/support request that might provide ideas for general VM improvements/failsafe setup steps like APCu use to split loads.

@kaoschris
Copy link
Author

@kaoschris please remove sensitive information from the logs (e.g. account names, paths) ...

Yes didnt fly over the infos, true that email adresses where in the logs didnt thought it would. thanks for the pointing to it...

The Customer user Windows Server 2016 with File Service SMB and nextcloud connects with external storage app to that shares, the users connect with thier LDAP / Active Directory Credentials to Nextcloud and thier Rights are in use like they should. Means CEOs did have access to alle folders and employees dont. Maybe there is an Conflickt handling massive Data und SMB/CFIS Shares and filecache? The Customer hast about 1.2 Million Files and 70k Folders, its an Media Company wich builds media (web,print, etc.)

@enoch85
Copy link
Member

enoch85 commented Aug 27, 2019

Customer user Windows Server 2016 with File Service SMB and nextcloud connects with external storage app to that shares, the users connect with thier LDAP / Active Directory Credentials to Nextcloud and thier Rights are in use like they should

I can't find the issue now, but I've seen this before - customers having issues with the combination of LDAP, SMB and PostgreSQL in Nextcloud. Especially when connecting LDAP when using SMB mounted externally. I had one guy like 1 year ago where PostgreSQL created tons of files in the data directory for PostgreSQL.

I will try to find the issue and link it here.

@enoch85
Copy link
Member

enoch85 commented Aug 27, 2019

Maybe not related after all, but here it is anyway: #11743

@kesselb
Copy link
Contributor

kesselb commented Aug 27, 2019

I suspect that files / folders are shared by the fileserver and configured as user mount are added to oc_filecache for every user.

Example: Fileshare with 100.000 files. User A and User B can access the share with their personal credentials. Technically these are two different storages so 200.000 are added to oc_filecache.

public function getId() {
// FIXME: double slash to keep compatible with the old storage ids,
// failure to do so will lead to creation of a new storage id and
// loss of shares from the storage
return 'smb::' . $this->server->getAuth()->getUsername() . '@' . $this->server->getHost() . '//' . $this->share->getName() . '/' . $this->root;
}

I'm not 100% sure about that but the code above creates the storage id and contains the username so the same fileshare with another credentials leads to two storages in nextcloud.

This could explain why your oc_filecache table contains a huge amount of data given that number of users and the size of the fileshare.

Nevertheless for such a big instance redis locking will decrease your database load.

"message":"Could not find resource core/vendor/marked/marked.min.js to load"

This is fixed by some app or nextcloud update.

High cpu load / long running selects

#16432 is part of 16.0.4 and helps to reduce the sql load.

@enoch85
Copy link
Member

enoch85 commented Aug 27, 2019

@kaoschris

#16432 is part of 16.0.4 and helps to reduce the sql load.

16.0.4 was just released, VM wise. But you could use the built in script to update it. sudo bash /var/scripts/update.sh

@MichaIng
Copy link
Member

MichaIng commented Aug 27, 2019

@kesselb

I suspect that files / folders are shared by the fileserver and configured as user mount are added to oc_filecache for every user.

That totally makes sense. How should Nextcloud know that it is the same share if added by multiple users separately.

This actually means:

  • The issue is most likely not that bad after all, since huge oc_filecache is expected and not due to some database "errors".
  • Especially if the Nextcloud instance otherwise still works well, besides the database overload symptoms reported and from logs, things (cache and file locking) can be tuned on the fly. During some closing time/calm night hours, a postgres database check and clearing of oc_file_locks when Redis took over, still makes sense.
  • This means that such a setup, a large number of users adding the same network share, is not optimal. In terms of database load, size and e.g. recent files scan jobs, it would be better if one admin (or group admin) would add those network shares only once, and re-shares them to the other users or a group, Nextcloud internally. Am I right?

@enoch85
Copy link
Member

enoch85 commented Aug 28, 2019

@MichaIng wrote

it would be better if one admin (or group admin) would add those network shares only once, and re-shares them to the other users or a group, Nextcloud internally. Am I right?

That's how I usually recommend doing things.

Makes sense to me at least.

@kaoschris
Copy link
Author

Okay ill then update to 16.0.4 and will do some maintenance on the table itself to set some Storage Free and then we rearange the acces rights to Nextcloud level instead of Windows Level.. I'll inform you guys if this helped out...

thanks for helping out ill catch up on that

@ghost
Copy link

ghost commented Sep 27, 2019

This issue has been automatically marked as stale because it has not had recent activity and seems to be missing some essential information. It will be closed if no further activity occurs. Thank you for your contributions.

@ghost ghost added the stale Ticket or PR with no recent activity label Sep 27, 2019
@ghost ghost closed this as completed Oct 11, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap bug needs info stale Ticket or PR with no recent activity
Projects
None yet
Development

No branches or pull requests

4 participants