-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log_visit.referer_url and log_action.name as TEXT instead of VARCHAR cause a lot of tmp table on disk #14535
Comments
@mattab that's interesting. Maybe would also help with |
Thanks for the report @airblag
Reckon that it should be fine for more than 99.9% of cases to have only 65535 character long URLs. |
@mattab |
We will schedule this for Matomo 4. Thanks for this @airblag |
@mattab I'm thinking we could actually already apply this change for new installs and only include the update in Matomo 4. This will save many thousands of migrations later. |
I think we should not make a varchar(65535) but something a bit smaller. The whole row in mysql can be maximum 65535 bytes long, so we need to let some place for the other fields in the row. We should make something like varchar(65535 - size(all other columns)) at maximum, and maybe let some place to extend the table with some extra rows for future schema changes. |
Yeah I reckon be better to set to something like 50 or 60K just to be safe. |
Let's change the type now for new installs, and already add an update for Matomo 4.0 which will be executed once they update to Matomo 4. |
We converted both TEXT fields into varchar in our production setup. But it still doesn't seem to reduce anyhow the amount of tables on disk created. I just checked all the tables of my piwik DB and grepped for text|blob and there are a lot more ! My dirty way of finding out : Maybe it would be good to check in the requests updating/inserting log data which text/blobs are used, and first replace only thoses fields with varchar. I'll try to have a look, but since I never looked into the code of matomo, it might be easier for someone who is familiar with it :) |
Let us know what you find. Wouldn't think it creates a temporary table during tracking but never know. |
So, it took my day of work, but I found out which query is producing the temporary table on disk !
A few notes/questions about this request :
I grepped through the code and it seems to be produced in @tsteur do you have an idea if 2. or 3. can be done without breaking something ? Some notes about debuggingFor getting it, I tried first to activate debugging and sql_profiling in matomo, but I hit the bug of logging to file which didn't worked. I applied manualy the patch for the https://patch-diff.githubusercontent.com/raw/matomo-org/matomo/pull/14296.patch on plugins/Monolog/config/config.php. But I couldn't find the correct requests in the log file. Maybe because I use the QueuedTracking plugin ? Then I just logged every query of mysql for a few seconds (just enough to let QueuedTracking write in the DB):
|
There is an index on
There used to be cases where the same action was added twice due to some race condition issues or something. See #6436 . The PR was done in #7112 . Looking at this, it still seems needed. Can you confirm @diosmosis ? Simply because there is no unique key on the DB and duplicates can happen. Wouldn't really know how to avoid it but maybe it be faster to change the Actually... seeing now this is already implemented. So the only reason the
The group by is there for the |
@tsteur We could just select the highest ID in PHP instead of in MySQL, would that work? |
@diosmosis that should work too. Considering there shouldn't be too many entries |
Does that mean we could remove the |
I don't think we can just remove it w/o something that deals w/ duplicates (that come from old bugs and from any new bugs), but if it causes problems we shouldn't have it. |
I meant if we change SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' ) OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' ) OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' ) OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' ) GROUP BY type, hash, name; to something like SELECT idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' ) OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' ) OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' ) OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' ) then we could do some logic like $idActions = array();
foreach ($rows as $row) {
$idAction = $row['idaction'];
$name = $row['name'];
if (!isset($idActions[$name])) {
$idActions[$name] = $row;
} else if ($idAction < $idActions[$name]['idaction']) {
$idActions[$name] = $row;
}
}
return array_values($idActions); Not tested or so... I simply meant if we don't fetch like crazy huge amount of values there, which we never do maybe I think, this could be faster and avoid tmp tables? |
Yes that's what I meant too. |
Ok, so the crc32 is there to make the index work better, but the probability of collision is to high, so we check both hash and name in the where clause ? What I was meaning in 3. was only to remove SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' ) OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' ) OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' ) OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' ) GROUP BY type, hash; since we filtered out possible collisions with the WHERE clause, I guess only grouping by type,hash should work too, or I am still missing something ? |
If you |
Considering the chances are small that we are fetching a few values that result in the same hash we could apply the quick fix to simply remove the |
it be great if you could patch your Matomo with the referenced PR. Feel free to wait though until there was a review. |
I just patched with the PR, and the tmp_tables dropped from 30 per seconds to around 0. Looks like it works for me :) EDIT: |
When I look at this page, there are still a plenty of cases where temporary tables might be created : https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html But there are only 3 cases where they have to be created on disk :
So converting TEXT to VARCHAR might avoid also creating temporary tables on disk in queries done by future features/plugins, even if it's solved for the current release. In this case it sounds like it was the "GROUP BY name" |
Be great to migrate these columns to VARCHAR:
and there is also:
|
See #14859 |
* Change column type for referer_url from text to varchar refs #14535 * Update ReferrerUrl.php
This seems done actually. |
FYI we decided to truncate URLs after 4096 characters. Hopefully it won't break people's analytics. If you're concerned about this change, let us know! we added |
Note: Referrer urls will be truncated to 1500 characters |
Hi,
My matomo installation is "big" (log_link_visit_action is ~40GB big for 55 days retention, log_visit is ~4GB), and we see over 97% of temporary tables created on disk in our mysql monitoring.
After trying to tune mysql with increasing tmp_table_size without any sucess I noticed that the log_visit table has the referer_url field set as TEXT.
it blocks the use of the memory engine of mysql for tmp tables writing all of them to disk :
https://dev.mysql.com/doc/refman/5.7/en/blob.html
"[...]
Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types (see Section 8.4.4, “Internal Temporary Table Use in MySQL”). Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.
[...]"
I'm not familiar with the code of matomo to look for the exact query producing the tmp_tables...
Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?
I tried to move the tmpdir of mysql to ramdisk, but since I try to optimize the DB once a week, and optimizing innodb needs a lot of place on /tmp it's always failing to optimize log_link_action_visit except I give more 40GB for my ramdisk...
The text was updated successfully, but these errors were encountered: