Extract timestamp metadata from external parsers output #467

lfcnassif · 2021-04-13T00:50:28Z

ExternalParsers.xml could be updated with regex rules to extract metadata from output of:

RecyclerBinParser
RecycleInfo2Parser
PrefetchParser
EvtxLogParser
EvtLogParser
SuperFetchParser

patrickdalla · 2023-06-14T09:43:19Z

Which would be the best design option to implement this:

A task with a configurable to choose which media type to extract timestamps from?
A utility class to be called from ExternalParser (or any other parser that wants to extract timestamps based on content regex search)
Any other option?

patrickdalla · 2023-06-14T10:58:31Z

The config of a task could map mime-types with correspondent eventType name.

lfcnassif · 2023-06-14T13:29:17Z

Actually this is not related to Tasks, I think we shouldn't create a Task to parse a Parser output, this should be done inside the parser.

ExternalParsers.xml already supports a syntax where you can specify a regex to search for a pattern and extract it as a metadata key, take a look at the FFmpeg and ExifTool examples commented at the top of that file. Maybe it is enough to handle some external parsers output. Maybe it is not enough to handle others and the syntax should be extended.

If needed, before extending it, I would take a look at Tika's ExternalParser current code (from where ours was copied from a long ago), I think they improved it and maybe their enhancements could help us.

patrickdalla · 2023-06-14T13:52:47Z

Nice. It seems straightforward, should no transformation is required.

patrickdalla · 2023-06-14T18:59:15Z

I'm not sure if IPED already recognizes metadata values as dates for formats other than the common ISO 8601. If not, so there will be the necessity to implement the date parsing, as at least sccainfo uses a different format.
`
sccainfo 20221027

Windows Prefetch File (PF) information:
Format version : 30
Prefetch hash : 0x6579e144
Executable filename : SVCHOST.EXE
Run count : 4
Last run time: 1 : Jun 13, 2023 14:00:32.055367400 UTC
Last run time: 2 : Jun 11, 2023 15:24:48.658356900 UTC
Last run time: 3 : Jun 10, 2023 15:23:19.200739100 UTC
Last run time: 4 : Jun 08, 2023 15:20:42.012481300 UTC
`

lfcnassif · 2023-06-14T19:26:11Z

A few formats are supported/parsed by the DateUtil class, adding more formats there is an option and it can benefit all parsers, not just the external ones.

used as metadata key and value. If inexistent, uses group 1 and 2 to keep backward compatibility. - Created regex for Recyclebin and Prefetch timestamps extractions. - Created parsing logic on DateUtil class to parse Prefetch dates.

patrickdalla · 2023-06-21T19:45:39Z

@lfcnassif ,

I've noted that the current behaviour of ExternalParsers.xml is not to export and index content and only extract metadatas when there is an existing metadata extraction regex pattern.
I thought in change it, and export the output as the item content even if there is a metadata tag. But this can break some backward compatibility with users config files.
To keep the compatibility, we could create an attribute to the metadata tag (onlyParseMetadata='false'), and let the current behavior as it is if no such an attribute is specified.

What do you think?

lfcnassif · 2023-06-21T20:06:33Z

I agree to you, for sure, to extract external tools output as parsed content even if there are regex patterns to extract metadata. But I think a new onlyParseMetadata='false' option is not needed. Even if users have a custom xml with metadata extraction rules, they wouldn't stop getting their metadata with the new behavior, they will get additional information as content. Actually I thought this was the current behavior, I didn't know about this Tika behavior detail.

patrickdalla · 2023-06-22T12:18:15Z

"I didn't know about this Tika behavior detail" - Well, it seems reasonable to have this option to parse only metadata. ExifToolParser, for example, should extract additional metadatas, but not change the image content. We should not even change the textual content of the image, as it may be added by the OCR parser.

So, if we want to keep the integral external tool command outputs (stdout or/and stderr), they should be added as subitems. Should I implement this?
Anyway, for some parsers the output can be add directly as the textual content (Recyclebin for example), for others, they should be added as subitems not to change others parsers added textual content. So, a config parameter is necessary to control this behavior for the specific parser.

patrickdalla · 2023-06-22T12:47:33Z

"So, if we want to keep the integral external tool command outputs (stdout or/and stderr), they should be added as subitems. Should I implement this?"

Maybe, instead of an attribute, we should define a tag, indicating if the output content should be ignored, added (contatenate or replace) the item textual content, or add stdout or stderr as subitem.
And the default behavior, if this tag is omitted, should be the current, i. e., add as the textual content if no metadata tag is configured and ignore if any metadata is defined.

patrickdalla · 2023-06-22T13:11:11Z

"Actually I thought this was the current behavior, I didn't know about this Tika behavior detail."
This does not seem to be the original ExternalParser Tika behavior, as can be inferred from https://github.com/apache/tika/blob/24446d99e13f157a1de5727f25fb20e5b9788381/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59.

This is a behavior of IPED implementation, maybe intentional, exactly not to change textual content of images or videos that should be complemented with external tools metadata information.

lfcnassif · 2023-06-22T14:12:05Z

Well, it seems reasonable to have this option to parse only metadata. ExifToolParser, for example, should extract additional metadatas, but not change the image content. We should not even change the textual content of the image, as it may be added by the OCR parser.

Ok, we can have this option, no problem for me. About overriding other parser output, it can not happen today. If we use the MultipleParser, it concatenates the output of all parsers and merges all metadata extracted by them. If MultipleParser is not used, just 1 parser can run over a file, this already happens today and you can't have OCR content + just ExifTool metadata without using the MultipleParser, you get one or another.

So, if we want to keep the integral external tool command outputs (stdout or/and stderr), they should be added as subitems. Should I implement this?
Anyway, for some parsers the output can be add directly as the textual content (Recyclebin for example), for others, they should be added as subitems not to change others parsers added textual content. So, a config parameter is necessary to control this behavior for the specific parser.

This subitem creation from external tool output is an old ideia. I think it can be useful if the external tool expands some kind of container (edited: not sure if it is possible to model the subitem consumption produced by arbitrary external tools in the xml), or if its output is so huge (e.g. evtx) that it would be better to break it in several subitems. So I think importing the tool output as subitem can be a good new option, keeping the old behavior for current parsers, and another option to break the output as many subitems based on size or some pattern can be useful.

lfcnassif · 2023-06-22T14:32:20Z

Maybe, instead of an attribute, we should define a tag, indicating if the output content should be ignored, added (contatenate or replace) the item textual content, or add stdout or stderr as subitem.

After I added a simple MultipleParser to IPED (https://github.com/sepinf-inc/IPED/blob/master/iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/MultipleParser.java), concatenating the output of configured parsers and merging all Metadata, Tika released a more configurable MultipleParser abstract class with some implementations (https://github.com/apache/tika/tree/main/tika-core/src/main/java/org/apache/tika/parser/multiple). I think content extraction as subitem was not added, but I think it is useful with some scenarios where concatenating the output of several (external) parsers does not make sense (how to concatenate a txt output with a PDF one?). Not sure if this should be added into the ExternalParser, into the MultipleParser, or both...

And the default behavior, if this tag is omitted, should be the current, i. e., add as the textual content if no metadata tag is configured and ignore if any metadata is defined.

OK.

lfcnassif · 2023-06-22T14:34:26Z

This does not seem to be the original ExternalParser Tika behavior, as can be inferred from https://github.com/apache/tika/blob/24446d99e13f157a1de5727f25fb20e5b9788381/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59.

Maybe not the current Tika behavior, but maybe the past one, I copied the ExternalParser code several years ago and Tika code improved.

patrickdalla · 2023-06-22T17:28:09Z

Well, for the scope of this specific issue, it was enough to define a TAG named appendTextContent.

If present with metadata tag, the stdout will be appended as item textual content.
If only metadata tag is present, without appendTextContent, stdout will be used only for metadata extraction.
If no one is present, the stdout will be appended as item textual content.

This keeps backward compatibility.

patrickdalla · 2023-06-22T17:39:45Z

What relates to split output content, we can put in a new issue. What do you think @lfcnassif ? Or is there already such an issue?

One idea is to accept output from STDOUT in TAR format, or other streamable format. Each found file would be "expanded".

lfcnassif · 2023-06-22T17:49:33Z

Well, for the scope of this specific issue, it was enough to define a TAG named appendTextContent.

If present with metadata tag, the stdout will be appended as item textual content.

If only metadata tag is present, without appendTextContent, stdout will be used only for metadata extraction.

If no one is present, the stdout will be appended as item textual content.

This keeps backward compatibility.

Fine to me. I would name the tag "(use|import|extract)AsTextContent" because there is no previous text content to append a new one, as I described before.

What relates to split output content, we can put in a new issue. What do you think @lfcnassif ? Or is there already such an issue?

Agreed, there is no such issue.

patrickdalla · 2023-06-22T18:17:38Z

"I would name the tag "(use|import|extract)AsTextContent"

tag renamed.
JUNIT tests created config files use the METADATA tag, but empty. So the code was also adapted to be backward compatible with JUNIT tests config file.

The PR is created.

lfcnassif · 2023-06-22T18:55:08Z

JUNIT tests created config files use the METADATA tag, but empty. So the code was also adapted to be backward compatible with JUNIT tests config file.

Great, thank you!

lfcnassif added the enhancement label Apr 13, 2021

lfcnassif mentioned this issue Sep 1, 2022

General ideas for improving the tool #1303

Closed

lfcnassif added this to To do in 4.2 via automation Feb 23, 2023

lfcnassif moved this from To do to In progress in 4.2 Jun 22, 2023

lfcnassif linked a pull request Jun 26, 2023 that will close this issue

Regex timestamp extract #1727

Open

patrickdalla mentioned this issue Jan 26, 2024

Add support for named groups in metadata extraction on ExternalParser #2062

Open

lfcnassif removed this from In progress in 4.2 Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract timestamp metadata from external parsers output #467

Extract timestamp metadata from external parsers output #467

lfcnassif commented Apr 13, 2021 •

edited

Loading

patrickdalla commented Jun 14, 2023

patrickdalla commented Jun 14, 2023

lfcnassif commented Jun 14, 2023 •

edited

Loading

patrickdalla commented Jun 14, 2023 •

edited

Loading

patrickdalla commented Jun 14, 2023

lfcnassif commented Jun 14, 2023

patrickdalla commented Jun 21, 2023

lfcnassif commented Jun 21, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading

lfcnassif commented Jun 22, 2023 •

edited

Loading

lfcnassif commented Jun 22, 2023

lfcnassif commented Jun 22, 2023

patrickdalla commented Jun 22, 2023

patrickdalla commented Jun 22, 2023

lfcnassif commented Jun 22, 2023

patrickdalla commented Jun 22, 2023 •

edited

Loading

lfcnassif commented Jun 22, 2023

Extract timestamp metadata from external parsers output #467

Extract timestamp metadata from external parsers output #467

Comments

lfcnassif commented Apr 13, 2021 • edited Loading

patrickdalla commented Jun 14, 2023

patrickdalla commented Jun 14, 2023

lfcnassif commented Jun 14, 2023 • edited Loading

patrickdalla commented Jun 14, 2023 • edited Loading

patrickdalla commented Jun 14, 2023

lfcnassif commented Jun 14, 2023

patrickdalla commented Jun 21, 2023

lfcnassif commented Jun 21, 2023 • edited Loading

patrickdalla commented Jun 22, 2023 • edited Loading

patrickdalla commented Jun 22, 2023 • edited Loading

patrickdalla commented Jun 22, 2023 • edited Loading

lfcnassif commented Jun 22, 2023 • edited Loading

lfcnassif commented Jun 22, 2023

lfcnassif commented Jun 22, 2023

patrickdalla commented Jun 22, 2023

patrickdalla commented Jun 22, 2023

lfcnassif commented Jun 22, 2023

patrickdalla commented Jun 22, 2023 • edited Loading

lfcnassif commented Jun 22, 2023

lfcnassif commented Apr 13, 2021 •

edited

Loading

lfcnassif commented Jun 14, 2023 •

edited

Loading

patrickdalla commented Jun 14, 2023 •

edited

Loading

lfcnassif commented Jun 21, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading

lfcnassif commented Jun 22, 2023 •

edited

Loading

patrickdalla commented Jun 22, 2023 •

edited

Loading