-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Csv importer: work with more problematic data #7456
Conversation
@@ -254,7 +298,7 @@ Importer.CSV = class ImporterCSV extends Importer.Base { | |||
const creator = this.getUserFromUsername(msg.username); | |||
if (creator) { | |||
const msgObj = { | |||
_id: `csv-${ csvChannel.id }-${ msg.ts }`, | |||
_id: `csv-${ csvChannel.id }-${ roomCounter }-${ msg.ts }`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only issue with this is that people who try to reimport data that has been imported before this change will see their data again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the code doing our exports take a timestamp range as input with output always being ordered by room-specific counters. The importer code doesn't change that order per room, so the extra counter should be stable for imports done with it.
The only way I see this code duplicating data on re-import is these two cases:
- The data being exported in a different order or file splitting, instead of the same files being used.
- The data being a different slice of the original chats, but including a percentage of already-imported ones.
Both of these seem to me like edge cases and not very likely to happen.
Before this change, the error crashed the import entirely, reported it in the log, but didn't mark the import as failed and didn't stop the querying on the client side. This made it annoying to find (backtracking through the logs) and required the restart of the RC instance to reset the import status.
This crash is something this pull request doesn't fix. It only fixes the different causes I encountered for crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@graywolf336 Is my explanation satisfactory, or should I rewrite the code to store the last timestamp and add a suffix on dup timestamps, so old imports won't ever have the chance to be duplicated? Or even store an array of seen timestamps...otherwise the code is dependent on the csv line order to be time-based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it is good, however you might want to include those caveats in the pull request's description so that it's easily visible to anyone who views this pull request from the change log. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@graywolf336 I tried to change the code so the IDs won't have an extra counter unless needed, so my changes won't break old re-import retries and noticed something odd.
It does work, but after the import the msgs counter in the admin rooms list double up. No duplicates in the room view that I noticed...
The funny thing, this happens with the old import code and IDs as well - duplicate messages counts, but unique when viewed.
Does this make sense? If so, I'll update the pull request with the safer code.
Otherwise two messages that received the same timestamp in the same room will break the import.
Interface clean up, with users&channels being allowed to be empty for CSV import.
Added view cleanup for skipped files and a block for total message count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the improvements.
This way, it will re-import old imports just fine. For new imports, the order of timestamps in the CSV files will have to be the same, always. So, either using the same dumps or enforcing order on the CSV dump is required.
[FIX] Csv importer: work with more problematic data
This is a continuation of #6768, fixing import when users.csv, channels.csv or both are missing.
It also fixes issues hit by doing a massive import from our internal chat implementation into RC.