Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Remove duplicates in the user_ips table and add an index #4370

Merged
merged 23 commits into from
Jan 11, 2019

Conversation

hawkowl
Copy link
Contributor

@hawkowl hawkowl commented Jan 9, 2019

No description provided.

@codecov-io
Copy link

codecov-io commented Jan 9, 2019

Codecov Report

Merging #4370 into develop will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4370      +/-   ##
===========================================
- Coverage    73.68%   73.64%   -0.05%     
===========================================
  Files          300      300              
  Lines        29815    29815              
  Branches      4895     4897       +2     
===========================================
- Hits         21970    21958      -12     
- Misses        6407     6419      +12     
  Partials      1438     1438
Impacted Files Coverage Δ
synapse/util/stringutils.py 61.7% <0%> (-6.39%) ⬇️
synapse/api/auth.py 87.96% <0%> (-2.78%) ⬇️
synapse/handlers/search.py 80.24% <0%> (ø) ⬆️
synapse/handlers/federation.py 61.72% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55c3e85...497945f. Read the comment docs.

@hawkowl hawkowl requested a review from a team January 9, 2019 14:00
@erikjohnston
Copy link
Member

A couple of notes:

  1. Given how large this table is I'd probably want to do this using the batching mechanism (which correctly balances the batches so that it doesn't overload the DB)
  2. We probably want to have the index be (user_id, access_token, ip), i.e. remove user_agent and add user_id. We don't really care about keeping unique UA's, we just use them to have an idea about what type the device is, and it really shouldn't change much in practice anyway. We add user_id mainly so that we can reuse the existing index during the de-duplication process (and then we can drop the existing index after adding the new one)
  3. We need to be careful that new inserts don't add duplicates while we're de-duplicating.

The way I'd probably do this is by iterating over the table using last_seen, and then deleting rows that have duplicate entries. Something like:

  1. Find a span of last_seen times that have roughly the right number of rows:
SELECT last_seen FROM user_ips
WHERE last_seen > ?
ORDER BY last_seen 
LIMIT 1
OFFSET ?
  1. Fetch all rows in that span which have duplicate entries (I've checked this query is fast enough):
SELECT user_id, access_token, ip, MAX(last_seen)
FROM (
    SELECT user_id, access_token, ip
    FROM user_ips
    WHERE ? <= last_seen AND last_seen < ?
    ORDER BY last_seen
) c
INNER JOIN user_ips USING (user_id, access_token, ip)
GROUP BY user_id, access_token, ip;
HAVING count(*) > 1
  1. Delete all rows matching the returned (user_id, access_token, ip) and reinserting one row with the returned last_seen.

  2. Repeat.

After we've successfully walked the entire table we probably need to:

  1. Stop inserting into user_ips
  2. Rerun the deduplicate for the time span since we started deduplication, i.e. for the period we might have inserted into
  3. Create the unique index.
  4. Allow insertions again.

Otherwise we risk inserting duplicates while trying to build the unique index :(

@hawkowl
Copy link
Contributor Author

hawkowl commented Jan 11, 2019

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Would be good to add a final BG update to remove the old index

VALUES (?, ?, ?, ?, ?, ?)
""",
(user_id, access_token, ip, device_id, user_agent, last_seen)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well use the _simple_{insert,delete}_txn wrappers here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeahhhh but everything else in this function is sql :P

synapse/storage/client_ips.py Show resolved Hide resolved
uid, access_token, ip = key
if uid == user_id:
user_id, access_token, ip = key
if user_id == user_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

yeah need some coffee

ORDER BY last_seen
) c
INNER JOIN user_ips USING (user_id, access_token, ip)
GROUP BY user_id, access_token, ip;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious railing ;

(user_id, access_token, ip, device_id, user_agent, last_seen)
)

self._background_update_progress_txn(txn, {"last_seen": last_seen})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing update name param

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants