Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and rename persistentID to trackID #784

Closed
lfcnassif opened this issue Oct 16, 2021 · 31 comments · Fixed by #934 or #937
Closed

Improve and rename persistentID to trackID #784

lfcnassif opened this issue Oct 16, 2021 · 31 comments · Fixed by #934 or #937
Assignees

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Oct 16, 2021

This is an ID that doesn't change between different runs used when resuming processing to skip already processed items. It is built hashing different concatenated IDs: path, idInDataSource (eg. sleuthID, ad1ID, ufdrID), subitemId, parentContainerPersistentID.

Current name isn't intuitive. Although it is not unique across different cases (it is not an UUID), changing it to globalID seems more user friendly. Any other suggestion?

@hauck-jvsh
Copy link
Member

Just let me if you change this name, as it is used in the SARD project.

@lfcnassif lfcnassif changed the title Rename persistentID to globalID Improve and rename persistentID to globalID Dec 2, 2021
@lfcnassif
Copy link
Member Author

Just saw we are not using the datasource UUID in the computation. If we include it, this ID would really work like an UUID and will be unique across cases, that would be great. Renaming to globalID then will totally make sense.

@lfcnassif lfcnassif self-assigned this Dec 2, 2021
@lfcnassif
Copy link
Member Author

lfcnassif commented Dec 2, 2021

parentContainerPersistentID

Just an explanation why this is used. We can have 2 different files with this same path (one allocated and other deleted):
/root/a/b.zip/c/d.txt
/root/a/b.zip/c/d.txt

The two d.txt files have no idInDatasource (they are subitems from zip, they aren't allocated) and their subitemId could be equal if they were extracted from 2 different b.zip files (one allocated and other deleted). So b.zip globalID (that uses b.zip IdInDatasource) must be used in the computation of d.txt globalID

@lfcnassif
Copy link
Member Author

lfcnassif commented Jan 20, 2022

Just saw we are not using the datasource UUID in the computation. If we include it, this ID would really work like an UUID and will be unique across cases, that would be great.

Thinking better about this, do we need an "UUID" for items in iped? Would it be useful in multicases? This change would make #918 very difficult, I doubt users will know/remember they need to specify the evidence UUID when re-processing cases to import old bookmarks later. Maybe this change to include the evidence UUID in globalID computation can be made just into ElasticSearchTask, what do you think @hauck-jvsh?

@hauck-jvsh
Copy link
Member

I think it could be used only in the elastic ID, I think you could also maintain the persistentID in elastic just not as _id field which must be unique.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jan 20, 2022

I think you could also maintain the persistentID in elastic just not as _id field which must be unique.

This was done to make the --continue option work when resuming a processing to ElasticSearch instead of having to delete a remote index and start the processing from beginning again. I think including the evidence UUID into persistentID/globalID computation should be enough to avoid _id conflicts between elastic cases.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jan 20, 2022

@hauck-jvsh what field are you using to store bookmarks into in elastic, _id?

edit: I mean, to correlate bookmarks to items?

@hauck-jvsh
Copy link
Member

hauck-jvsh commented Jan 20, 2022

Currently I'm using the _id just to find the item and then set a new metadata with the bookmark in the item.

@lfcnassif
Copy link
Member Author

@hauck-jvsh, I changed the attribute names persistentId->globalId, parentPersistentId->parentGlobalId, parentContainerPersistentId->containerGlobalId, and also ElasticSearchTask contentPersistentId -> contentGlobalId to follow the new naming convention.

@hauck-jvsh
Copy link
Member

After that commit an error is occurring when processing cases, see the log file attached.
IPED-2022-01-21-10-25-00.log

@hauck-jvsh hauck-jvsh reopened this Jan 21, 2022
@lfcnassif
Copy link
Member Author

Thanks @hauck-jvsh, I'll take a look.

Actually I'm still not convinced about the new globalID attribute name, since it could repeat across cases without including the evidenceUUID in the computation. As you said, we can create a real UUID for items for possible future use in a new attribute (maybe using the globalID name), I like this idea.

But about persistentID renaming, I thought about more options: fixedID, constantID, constID. What do you think? @tc-wleite do have any suggestion?

@wladimirleite
Copy link
Member

After that commit an error is occurring when processing cases, see the log file attached. IPED-2022-01-21-10-25-00.log

Processing an E01 image worked, but when I tried to process a folder, got a similar exception here.

@wladimirleite
Copy link
Member

But about persistentID renaming, I thought about more options: fixedID, constantID, constID. What do you think? @tc-wleite do have any suggestion?

I was following the discussions around this issue, but I am not sure what would be the best option.

@hauck-jvsh
Copy link
Member

hauck-jvsh commented Jan 25, 2022

There is also the multivalued parentIds property (different from parentId) with all item parents, used to allow fast filtering on file tree

I also use it to allow filtering using the file tree in the web interface.

@lfcnassif
Copy link
Member Author

Have you tested an implementation with just parentId, right? Did it have a noticeable performance impact?

@hauck-jvsh
Copy link
Member

Have you tested an implementation with just parentId, right? Did it have a noticeable performance impact?

I couldn't make the searches, because I have to filter items that has in their parentdIds the ids of the selected items.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jan 25, 2022

I couldn't make the searches, because I have to filter items that has in their parentdIds the ids of the selected items.

I see, this would need some recursive search, possibly Elastic doesn't have a support for that, but we could try to implement this inside iped...

lfcnassif added a commit that referenced this issue Jan 26, 2022
- this property is needed when resuming processing to get a previous
parent id referenced by subitems which parents were not commited, then
when reprocessing parents, their id can be updated to the previous
value, so parent-child relationships will be preserved.
lfcnassif added a commit that referenced this issue Jan 26, 2022
- fix embedded disks subitems references to parentGlobalID
@lfcnassif lfcnassif changed the title Improve and rename persistentID to globalID Improve and rename persistentID to trackID Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants