Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegram iOS DB parser enhancements and bug fixes (#1998, #2000, #2003 and #2005) #1999

Merged
merged 35 commits into from
Jan 25, 2024

Conversation

wladimirleite
Copy link
Member

@wladimirleite wladimirleite commented Nov 24, 2023

Closes #1998, #2000, #2003 and #2005.

@lfcnassif
Copy link
Member

Thank you @wladimirleite! I'll collect samples to help testing this a bit more.

@wladimirleite
Copy link
Member Author

Processing 12 Telegram iOS databases I collected here, I got:

Master This PR
Chats 841 1,354
Messages 23,170 81,987

@lfcnassif
Copy link
Member

Hi @wladimirleite! Here are the testing results on 19 Telegram iOS databases I collected here so far:

Master This PR
Chats 1,633 2,762
Messages 6,938 61,999

After commit I pushed, we are recognizing more Telegram databases. In the other hand, 13 new exceptions are being thrown because of no such table t0 (ios user account extract query) and no such table t2 (ios contacts extract query). We should handle new tables from new versions properly...

@wladimirleite
Copy link
Member Author

After commit I pushed, we are recognizing more Telegram databases. In the other hand, 13 new exceptions are being thrown because of no such table t0 (ios user account extract query) and no such table t2 (ios contacts extract query). We should handle new tables from new versions properly...

Great!
I can take a look at this, unless you already started. If not, please send me some sample databases that caused these exceptions.

@lfcnassif
Copy link
Member

I can take a look at this, unless you already started. If not, please send me some sample databases that caused these exceptions.

I didn't, this week I'm in a meeting all days... I'll send you by Teams, thank you very much @wladimirleite!

@lfcnassif
Copy link
Member

please send me some sample databases that caused these exceptions.

Just sent.

@wladimirleite
Copy link
Member Author

@lfcnassif, taking a closer look in the samples you sent me and others I have here, I think that the databases with different set tables, like (t0, t2 and t3 only) and (t15, t16, t17, t19, t20 and t21), should not be identified as "application/x-telegram-db-ios".

Although they are named "db_sqlite" and used by Telegram, they contain different types of information, not related to contacts or messages chats. And they are not a newer version of the Telegram database we support.

Below is an example of a folder structure present in a very recent iOS extraction.
Although all files are named "db_sqlite", only the first one is actually the kind of database that contains contacts and chats.
image

So I am not sure if the changes you made for #2000 are the best approach.
The main issue with the previous detector code (before your changes) was expecting a "ft41" table, which is not present in all "valid" databases.

Tables actually used are t0, t2, t6, t7 and t9.
My suggestion is check only for these tables, as all the samples I have, more recent or older ones, including the ones you sent me, have these tables.
I would also remove detection by name, as it is detecting false positives.
Do you see any drawback of making these changes?
Anyway, I will commit the changes I made, after running more tests, so you can test with your set of databases when you have some time.
We can revert them later, if necessary.

@wladimirleite
Copy link
Member Author

@lfcnassif, I found another (not directly related) issue that some databases I have here are not being parsed because of an exception in contacts parsing.
Looking at the code and the raw bytes, I found that the decoding process uses the function findOfset(key, tString), which may find incorrect offsets in rare situations.

For example, when extracting the "title" property of a contact, it calls that function to find a "t".
Then it will search for 3 bytes: the length (1), 't' (116) and the type string (which is 4). The problem is that [1, 116, 4] may be another thing, not this "title" tag. So when it tries to read a string after these 3 bytes, the length of the string (next 4 bytes) may be something else (like a huge or negative number).

This is rare, but it happens a couple of times in the DB samples I have.
There is no simple bug fix, as this is more a conceptual problem.
I will try to evaluate how complex it would be to rewrite it.

@lfcnassif
Copy link
Member

@lfcnassif, taking a closer look in the samples you sent me and others I have here, I think that the databases with different set tables, like (t0, t2 and t3 only) and (t15, t16, t17, t19, t20 and t21), should not be identified as "application/x-telegram-db-ios".

Sorry @wladimirleite, I didn't test my change for false positive detections, my fault.

Tables actually used are t0, t2, t6, t7 and t9.
My suggestion is check only for these tables, as all the samples I have, more recent or older ones, including the ones you sent me, have these tables.
I would also remove detection by name, as it is detecting false positives.
Do you see any drawback of making these changes?

I totally agree with both recommendations, testing just for those tables and removing detection by name, I didn't know other different databases with db_sqlite name exist.

For example, when extracting the "title" property of a contact, it calls that function to find a "t".
Then it will search for 3 bytes: the length (1), 't' (116) and the type string (which is 4). The problem is that [1, 116, 4] may be another thing, not this "title" tag. So when it tries to read a string after these 3 bytes, the length of the string (next 4 bytes) may be something else (like a huge or negative number).

I see... If it is is a heuristic to read serialized data (I didn't check the code) maybe using a proper deserialize function may help...

@wladimirleite
Copy link
Member Author

For example, when extracting the "title" property of a contact, it calls that function to find a "t".
Then it will search for 3 bytes: the length (1), 't' (116) and the type string (which is 4). The problem is that [1, 116, 4] may be another thing, not this "title" tag. So when it tries to read a string after these 3 bytes, the length of the string (next 4 bytes) may be something else (like a huge or negative number).

I see... If it is is a heuristic to read serialized data (I didn't check the code) maybe using a proper deserialize function may help...

As this is not related to the original issues, and this is clearly a bug (although rare), I am creating another issue.

@wladimirleite wladimirleite changed the title Review Telegram chat messages decoding, to handle more recent DBs (#1998) Telegram iOS DB parser enhancements and bug fixes (#1998, #2000 and #2003) Nov 29, 2023
@wladimirleite wladimirleite marked this pull request as draft November 29, 2023 14:18
@wladimirleite
Copy link
Member Author

There are other things that could be improved in the Telegram internal parser, but this PR already include too many changes.
I am stopping here, so this can be reviewed.

@wladimirleite wladimirleite marked this pull request as ready for review December 22, 2023 23:01
@lfcnassif
Copy link
Member

lfcnassif commented Dec 22, 2023

Thank you very much @wladimirleite!

@hauck-jvsh, since you are the original author of this parser, could you review this PR when you have available time?

@hauck-jvsh
Copy link
Member

hauck-jvsh commented Jan 3, 2024

I can review, if time isn't a problem, this when I return from my vacation.

@hauck-jvsh
Copy link
Member

I will start reviewing this tomorrow, is there any chances of getting some IOS databases? @lfcnassif, @wladimirleite

@wladimirleite
Copy link
Member Author

I will start reviewing this tomorrow, is there any chances of getting some IOS databases? @lfcnassif, @wladimirleite

I am sending you (through Teams) the samples I have (the whole folder, which include also a few Android databases). It also includes iOS samples @lfcnassif sent me recently.

@lfcnassif
Copy link
Member

I will start reviewing this tomorrow, is there any chances of getting some IOS databases? @lfcnassif, @wladimirleite

Thank you @hauck-jvsh! I think all my samples are included in @wladimirleite's data set.

@hauck-jvsh
Copy link
Member

hauck-jvsh commented Jan 18, 2024

Excellent work, you have rewrite the PostBoxCoding from scratch, now you are decoding the entire object instead of the previous approach of find specific key inside the bytes. I think that now it is much more general, preventing decoding data with the wrong type or even data from other places.

@hauck-jvsh
Copy link
Member

@wladimirleite, I'm running some tests here and it is taking a lot of time to process. I think that this is same problem you mentioned before, that messages have thousands of number in the messageTo metadata. Did you forget to push the changes?

@wladimirleite
Copy link
Member Author

@wladimirleite, I'm running some tests here and it is taking a lot of time to process. I think that this is same problem you mentioned before, that messages have thousands of number in the messageTo metadata. Did you forget to push the changes?

Yes, it takes way too long for some DBs. Discussing with @lfcnassif, I created a new issue (#2011), as it is not directly related to the issues addressed by this PR, and ideally any solution (probably changing the current behavior) would have to be applied to other chat parsers.
To test this PR, I commented out the loop that sets "Communication:To" property for each message (for groups chats).

@hauck-jvsh
Copy link
Member

After digging a little I manage to extract some thumbs from the database.

@hauck-jvsh
Copy link
Member

Just one thing, previously the default behavior was to use the UFED resultfor IOS and the internal parser for Android, but now the default behavior will change, but it looks good to me, I think that we can merge.
@wladimirleite can you check the extraction of thumbs please?

@wladimirleite
Copy link
Member Author

@wladimirleite can you check the extraction of thumbs please?

Sure! I will take a look as soon as possible.

@hauck-jvsh hauck-jvsh self-requested a review January 25, 2024 14:24
@wladimirleite
Copy link
Member Author

@wladimirleite can you check the extraction of thumbs please?

@hauck-jvsh, the thumbnail extraction worked great!
That is a really useful enhancement! Thank you!!!

I ran a few more tests, checking with an iPhone that I am working on, and everything looks fine.

I just pushed two minor commits (code formatting and preserving line breaks in messages). The line breaks make some messages more readable (I made a similar change in WhatsApp parser recently).

@hauck-jvsh
Copy link
Member

I think that this is ready for merging. I had to reinstall my eclipse and I didn't apply the project formatting style.

@lfcnassif
Copy link
Member

Sorry guys, I had to force push to fix a wrong conflict resolution pushed before, seems fine now.

Thank you very much @wladimirleite and @hauck-jvsh for this great work!

@lfcnassif lfcnassif merged commit 183016b into master Jan 25, 2024
2 checks passed
@lfcnassif lfcnassif deleted the #1998_FixTelegramMsgDecode branch January 25, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants