Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract timestamp metadata from external parsers output #467

Open
lfcnassif opened this issue Apr 13, 2021 · 19 comments · May be fixed by #1727
Open

Extract timestamp metadata from external parsers output #467

lfcnassif opened this issue Apr 13, 2021 · 19 comments · May be fixed by #1727

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Apr 13, 2021

ExternalParsers.xml could be updated with regex rules to extract metadata from output of:

  1. RecyclerBinParser
  2. RecycleInfo2Parser
  3. PrefetchParser
  4. EvtxLogParser
  5. EvtLogParser
  6. SuperFetchParser
@lfcnassif lfcnassif added this to To do in 4.2 via automation Feb 23, 2023
@patrickdalla
Copy link
Collaborator

Which would be the best design option to implement this:

  • A task with a configurable to choose which media type to extract timestamps from?
  • A utility class to be called from ExternalParser (or any other parser that wants to extract timestamps based on content regex search)
  • Any other option?

@patrickdalla
Copy link
Collaborator

The config of a task could map mime-types with correspondent eventType name.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jun 14, 2023

Actually this is not related to Tasks, I think we shouldn't create a Task to parse a Parser output, this should be done inside the parser.

ExternalParsers.xml already supports a syntax where you can specify a regex to search for a pattern and extract it as a metadata key, take a look at the FFmpeg and ExifTool examples commented at the top of that file. Maybe it is enough to handle some external parsers output. Maybe it is not enough to handle others and the syntax should be extended.

If needed, before extending it, I would take a look at Tika's ExternalParser current code (from where ours was copied from a long ago), I think they improved it and maybe their enhancements could help us.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Jun 14, 2023

Nice. It seems straightforward, should no transformation is required.

@patrickdalla
Copy link
Collaborator

I'm not sure if IPED already recognizes metadata values as dates for formats other than the common ISO 8601. If not, so there will be the necessity to implement the date parsing, as at least sccainfo uses a different format.
`
sccainfo 20221027

Windows Prefetch File (PF) information:
Format version : 30
Prefetch hash : 0x6579e144
Executable filename : SVCHOST.EXE
Run count : 4
Last run time: 1 : Jun 13, 2023 14:00:32.055367400 UTC
Last run time: 2 : Jun 11, 2023 15:24:48.658356900 UTC
Last run time: 3 : Jun 10, 2023 15:23:19.200739100 UTC
Last run time: 4 : Jun 08, 2023 15:20:42.012481300 UTC
`

@lfcnassif
Copy link
Member Author

A few formats are supported/parsed by the DateUtil class, adding more formats there is an option and it can benefit all parsers, not just the external ones.

patrickdalla added a commit that referenced this issue Jun 16, 2023
used as metadata key and value. If inexistent, uses group 1 and 2 to
keep backward compatibility.
- Created regex for Recyclebin and Prefetch timestamps extractions.
- Created parsing logic on DateUtil class to parse Prefetch dates.
@patrickdalla
Copy link
Collaborator

@lfcnassif ,

I've noted that the current behaviour of ExternalParsers.xml is not to export and index content and only extract metadatas when there is an existing metadata extraction regex pattern.
I thought in change it, and export the output as the item content even if there is a metadata tag. But this can break some backward compatibility with users config files.
To keep the compatibility, we could create an attribute to the metadata tag (onlyParseMetadata='false'), and let the current behavior as it is if no such an attribute is specified.

What do you think?

@lfcnassif
Copy link
Member Author

lfcnassif commented Jun 21, 2023

I agree to you, for sure, to extract external tools output as parsed content even if there are regex patterns to extract metadata. But I think a new onlyParseMetadata='false' option is not needed. Even if users have a custom xml with metadata extraction rules, they wouldn't stop getting their metadata with the new behavior, they will get additional information as content. Actually I thought this was the current behavior, I didn't know about this Tika behavior detail.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Jun 22, 2023

"I didn't know about this Tika behavior detail" - Well, it seems reasonable to have this option to parse only metadata. ExifToolParser, for example, should extract additional metadatas, but not change the image content. We should not even change the textual content of the image, as it may be added by the OCR parser.

So, if we want to keep the integral external tool command outputs (stdout or/and stderr), they should be added as subitems. Should I implement this?
Anyway, for some parsers the output can be add directly as the textual content (Recyclebin for example), for others, they should be added as subitems not to change others parsers added textual content. So, a config parameter is necessary to control this behavior for the specific parser.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Jun 22, 2023

"So, if we want to keep the integral external tool command outputs (stdout or/and stderr), they should be added as subitems. Should I implement this?"

  • Maybe, instead of an attribute, we should define a tag, indicating if the output content should be ignored, added (contatenate or replace) the item textual content, or add stdout or stderr as subitem.
  • And the default behavior, if this tag is omitted, should be the current, i. e., add as the textual content if no metadata tag is configured and ignore if any metadata is defined.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Jun 22, 2023

"Actually I thought this was the current behavior, I didn't know about this Tika behavior detail."
This does not seem to be the original ExternalParser Tika behavior, as can be inferred from https://github.com/apache/tika/blob/24446d99e13f157a1de5727f25fb20e5b9788381/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59.

This is a behavior of IPED implementation, maybe intentional, exactly not to change textual content of images or videos that should be complemented with external tools metadata information.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jun 22, 2023

Well, it seems reasonable to have this option to parse only metadata. ExifToolParser, for example, should extract additional metadatas, but not change the image content. We should not even change the textual content of the image, as it may be added by the OCR parser.

Ok, we can have this option, no problem for me. About overriding other parser output, it can not happen today. If we use the MultipleParser, it concatenates the output of all parsers and merges all metadata extracted by them. If MultipleParser is not used, just 1 parser can run over a file, this already happens today and you can't have OCR content + just ExifTool metadata without using the MultipleParser, you get one or another.

So, if we want to keep the integral external tool command outputs (stdout or/and stderr), they should be added as subitems. Should I implement this?
Anyway, for some parsers the output can be add directly as the textual content (Recyclebin for example), for others, they should be added as subitems not to change others parsers added textual content. So, a config parameter is necessary to control this behavior for the specific parser.

This subitem creation from external tool output is an old ideia. I think it can be useful if the external tool expands some kind of container (edited: not sure if it is possible to model the subitem consumption produced by arbitrary external tools in the xml), or if its output is so huge (e.g. evtx) that it would be better to break it in several subitems. So I think importing the tool output as subitem can be a good new option, keeping the old behavior for current parsers, and another option to break the output as many subitems based on size or some pattern can be useful.

@lfcnassif
Copy link
Member Author

  • Maybe, instead of an attribute, we should define a tag, indicating if the output content should be ignored, added (contatenate or replace) the item textual content, or add stdout or stderr as subitem.

After I added a simple MultipleParser to IPED (https://github.com/sepinf-inc/IPED/blob/master/iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/MultipleParser.java), concatenating the output of configured parsers and merging all Metadata, Tika released a more configurable MultipleParser abstract class with some implementations (https://github.com/apache/tika/tree/main/tika-core/src/main/java/org/apache/tika/parser/multiple). I think content extraction as subitem was not added, but I think it is useful with some scenarios where concatenating the output of several (external) parsers does not make sense (how to concatenate a txt output with a PDF one?). Not sure if this should be added into the ExternalParser, into the MultipleParser, or both...

  • And the default behavior, if this tag is omitted, should be the current, i. e., add as the textual content if no metadata tag is configured and ignore if any metadata is defined.

OK.

@lfcnassif
Copy link
Member Author

This does not seem to be the original ExternalParser Tika behavior, as can be inferred from https://github.com/apache/tika/blob/24446d99e13f157a1de5727f25fb20e5b9788381/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59.

Maybe not the current Tika behavior, but maybe the past one, I copied the ExternalParser code several years ago and Tika code improved.

@patrickdalla
Copy link
Collaborator

Well, for the scope of this specific issue, it was enough to define a TAG named appendTextContent.

  1. If present with metadata tag, the stdout will be appended as item textual content.
  2. If only metadata tag is present, without appendTextContent, stdout will be used only for metadata extraction.
  3. If no one is present, the stdout will be appended as item textual content.

This keeps backward compatibility.

@patrickdalla
Copy link
Collaborator

What relates to split output content, we can put in a new issue. What do you think @lfcnassif ? Or is there already such an issue?

One idea is to accept output from STDOUT in TAR format, or other streamable format. Each found file would be "expanded".

@lfcnassif
Copy link
Member Author

Well, for the scope of this specific issue, it was enough to define a TAG named appendTextContent.

  1. If present with metadata tag, the stdout will be appended as item textual content.
  2. If only metadata tag is present, without appendTextContent, stdout will be used only for metadata extraction.
  3. If no one is present, the stdout will be appended as item textual content.

This keeps backward compatibility.

Fine to me. I would name the tag "(use|import|extract)AsTextContent" because there is no previous text content to append a new one, as I described before.

What relates to split output content, we can put in a new issue. What do you think @lfcnassif ? Or is there already such an issue?

Agreed, there is no such issue.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Jun 22, 2023

"I would name the tag "(use|import|extract)AsTextContent"

  • tag renamed.
  • JUNIT tests created config files use the METADATA tag, but empty. So the code was also adapted to be backward compatible with JUNIT tests config file.

The PR is created.

@lfcnassif
Copy link
Member Author

  • JUNIT tests created config files use the METADATA tag, but empty. So the code was also adapted to be backward compatible with JUNIT tests config file.

Great, thank you!

@lfcnassif lfcnassif moved this from To do to In progress in 4.2 Jun 22, 2023
@lfcnassif lfcnassif linked a pull request Jun 26, 2023 that will close this issue
@lfcnassif lfcnassif removed this from In progress in 4.2 Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants