-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructuring the model #203
Comments
I agree. |
Implemented at #204. To be discussed. Open questions and my preference:
|
I'm in favor of this change. It simplifies things. Personally I need to get used to term mediaID but if you think of sequences as frames that could just as well have been a video it's easy to understand. |
I also find this intuitive in terms of the class layout and relationships.
Out of curiosity - are images ever manipulated, e.g. cropping out a section and creating a new image? If so, the I think it might be useful to add a |
@timrobertson100 not necessarily to create a new physical medium. But subsections (bounding boxes) of images are quite common by e.g. AI to indicate where in the image it noticed an animal. That info can currently not be captured in Camtrap DP v1 (needs more thought). Options are to represent those as as sub-images (with a
I agree.
|
Mainly for reasons of keeping things intuitive, and to avoid mixing concepts I'd favor a type (or similarly named) field. By mixing concepts, I mean that capture is related to what happened in the field to "trigger" the media existing, fileMediaType is about the encoding of the binary stream and sequence is really just a grouping of items largely for data management purposes (i.e. allow you to refer to a grouping of items in an annotation). Those seem like separate concerns to me which warrant their own field. Aside: this model implies media would only ever exist in a single sequence unless you duplicate media records with e.g. the same filename (meaning observations are based on an image in a particular sequence and not on the image itself). I don't know enough to comment if that is appropriate. |
As far as concerns the second remark : that looks right to me : an image only exists in one single sequence - however, the same image can be the source for two different observations; For me it still is a little bit confusing that, if I get it right, in the new data model in the media.csv there are some records referring to single images and others records referring to sequences that contain images that are listed in the same media.csv -table. It looks to me like two different levels of information are contained within the same table - not being a data scientist this is the first time I encounter this kind of a mixed-levels table in a data model :-) |
It is a common modeling pattern to include multiple subtypes of an entity within a single table and to distinguish them with a type field to void having to create additional tables or hierarchical structures. Here that pattern seems well justified. Another part of that pattern is to name the type field based on the table it is in and concept it represents so that it can stand alone without context in a data dictionary (a glossary of terms). Based on these practices, I would recommend the term be adopted and that it be called "mediaType". |
This may a bit overly cautious, but I'd opt for acquisitionType instead of mediaType to avoid confusion/overlap with the common use of mediaType as a reference to the MIME Media Types. |
@ben-norton I think this probably arises from the media table serving multiple roles for the sake of simplification. I agree that the mediaType should be limited to media types - digital results. I think that still needs to be there. To me the acquisitionType is a statement about the event (something not explicitly modeled by the Camtrap DP structure) that generated the result. In a model that expresses this activity explicitly, I would indeed include something to specify that. In the GBIF publishing model we're doing in parallel, that would be an eventType. |
I'm not sure mixing (sub)types is that common. To me it is the biggest icky factor in an otherwise elegant proposal (cf comments by @jimcasaer @timrobertson100). I'd there for like to suggest an approach that deviates less from the current situation. For clarity, I'm also naming the proposals:
Suggested change 2 (an less drastic update to the current situation)
Image-based observations
Sequence-based observations
|
@peterdesmet I understand what you are trying to do, and even why. It only makes me cringe from a database modeling perspective where in SQL databases one tries to achieve the highest reasonable Normal Form (https://en.wikipedia.org/wiki/Database_normalization#Normal_forms) to protect against redesign problems with changes that might come in the future. In Suggested Change 2 you are treating sequences as properties (albeit properties of two distinct entities), not as identifiers of an entity to use in the role of a key. The reason you can "get away with that" is that sequences have no non-identifying properties. So the thing that worries me (the "cringe factor") is that you are painting yourself into a corner. If you ever do add non-identifying properties to sequences in the future, you will have to repeat that information in media.csv or observations.csv or both, or add a sequence.csv with relationships to media and observations, and thereby change the structure in a way that will break existing implementations. Suggested change 1 doesn't overcome future-proofing sequences either, by the way, it treats them as one of the types of media with no properties of their own. For demonstration only, a model that would future-proof sequences (and be in 5th normal form - 5NF) would be something like the following:
|
Commenting here as a relative outsider to the project. Overall I think this goes in the right direction: deployments create media, media lead to observations. In my opinion sequences are an artificial add-on without any real benefits, but I never used it myself and also don't really how sequences are meant to be used in this standard, so I may be missing important points. Conceptual concerns
Practical concerns
I see three possible cases (with their data relationships):
A: easiest option. No sequences needed at all. If for some compatibility reason it is necessary to always have a sequence table, each media item can be considered a separate sequence and data structure would be identical to B (it would be redundant and a bit silly though). B: can be created automatically from image-based annotation in A using C: is this even necessary (can media.csv be missing)? Maybe relevant for old data sets? The only real difference is: Would it be possible to set a flag in the project metadata as to which case it is (and thus, which key to use)? Scope for automation?
VideosThe points above are for images only. Video support in this scheme may lead to additional complications:
SuggestionI suggest having a look at the database structure of digiKam for inspiration. I find it very clear, logical and extensible, but different from the current cameratrap DP scheme. If you have digiKam installed you can open its database in R with:
In short, it contains 5 items:
This is the content of each of these items as used by digiKam (not all of which would be needed for camera trapping data): $AlbumRoots $Albums $Images $Tags $ImageTags This scheme can be expanded nicely, e.g. a separate table for sequences (which assigns sequences to the file ids in the "Images" table - can maybe be created automatically as mentioned above). This would allow easily gathering of image tags (species IDs etc) and image information (timestamps etc) for sequences. Future proofing for deep learningIt would also allow easy linking to AI / deep learning methods, e.g. with a separate table containing bounding box coordinates for object detection. This would work both for model training and model deployment, and can maybe be based on the COCO camera traps format. Then there can be another table containing the labels and confidence values for these bounding boxes. For model training this second table only needs one label, for predictions it can either contain the top label and probability only, or top k labels, or all labels with their probabilities. Also, all these deep learning methods for image classification / object detection that I'm aware of use images, not sequences. Sequences can actually be harmful in this respect, especially for image classification (when the animal walked out of the frame during the sequence, but the entire sequence is labelled as a species). * EDIT: COCO camera trap format allows both image and sequence-specific bounding boxes, which may not be precise at image-level (see link above). I find the statement that 'sequences are the "atom of interest" in most ecological applications' questionable though. Video annotation at the file level should be no different than image annotation. I don't know how to annotate at the frame level. |
Thanks @tucotuco and @jniedballa! I had some time to digest this information and discussed it with @damianooldoni. We think the following suggestion would be a model that answers the issues. It will not solve - but can represent - the fact that some systems make observations at the level of "sequences/groups of images" (which restricts creating smaller events at the analysis stage). Suggested change 3: 4th table, between observations and media
Example:
@jniedballa |
@peterdesmet as a relative outsider I like the look of this new "suggested change 3" better than previous ones. It seems correct to me that sequences are not considered media files. Your bounding box example is clear; I can see that the format also allows for an observation which is based on a bounding-box that moves/changes shape over the duration of a sequence (this is one "tricky case" we discuss sometimes). But then would the |
Yes indeed. It doesn’t necessarily need to be there. |
Alternative name for |
@danstowell could we consider that 4th table a "region of interest" (Section 7.11 of https://ac.tdwg.org/termlist/)?
Could a region of interest also be larger than a single image file? |
FWIW I'm OK with |
Hi all and sorry for this late feedback! Great discussion! I have spent some time recent days thinking about the last proposal and have had the meeting with @peterdesmet this morning. Here is the outcome; below you will find two new proposals that (hopefully) still add something to our discussion: Suggested change 4: 4 tables (similar to the Suggested change 3 with some modifications)
Suggested change 5: 3 tables (similar to the original model with some modifications; developed interactively during the meeting with Peter)Sequence-based example
File-based example
@peterdesmet Please edit this comment if you find that I have missed sth (or if sth is not clear enough)! Best, |
Thanks @kbubnicki, great summary of our discussion. I just want to add that in suggestion 4 the number of records in
I’m all in favour of suggestion 5. Feedback welcome, especially from those that commented already @tucotuco @danstowell @jniedballa … |
I'm not so excited by the idea of moving the bboxes into the A workaround would be to repeat multiple rows in I can't comment on the file-size implications. You write "Think about two-stage observation process" (detect, then identify) but to me that doesn't motivate the change. A separate and minor comment: I suggest that the arrays-of-bboxes format might be a bit troublesome for data consumers - it's starting to look like structured data inside a CSV cell. |
I don't have a lot of time to comment in detail (i.e., offer alternative solutions) right now.
|
Just had a chat with @peterdesmet about my most recent comments. If it will be a rule that data sets must be either of observations from media or observations from mediagroups, but never both, then my second concern doesn't really apply. Similarly, if data sets are never mixed, then the mediaID could act as a mediaGroupID for the sake of practicality (not having to mint another identifier). I cringe in terms of semantics (it was rejected that mediagroups were just a type of media), but that shouldn't matter until/unless these data start to be linked semantically. |
I think the stipulation that a dataset is either sequence-based (observation - mediaGroup) or image-based (observation - media) is a fair stipulation that solves a number of problems. Since most datasets don't utilize multiple observation techniques (e.g., expert identification and computer vision model), adoption shouldn't be overly problematic for most providers. Several projects arrived at this same conclusion (after months of debate). To my knowledge, field testing this solution hasn't resulted in any significant problems. |
@danstowell Thats why we have this field in Camtrap DP: https://tdwg.github.io/camtrap-dp/data/#observations.countnew We use this field when annotating our camera trap records to track information about a "real" group size of animals staying for a while in front of a camera trap (or just passing it by). This applies to image-level annotation and prevents over-counting when aggregating data for analysis. |
Hi all, I picked up this dormant issue with John Wieczorek (@tucotuco) in an effort to reach a recommendation. We mainly discussed the pros and cons of two of the main proposals suggested above:
I also compared how one would query data using either model, at https://github.com/peterdesmet/camtrap-dp-query-test (repository likely to be deleted at some point). RecommendationOur conclusion is that the mediaGroupID approach (Suggested change 5):
And thus a reasonable simplification of the model. It is an improvement over the current model (where information is needlessly repeated) and plays well with the unified common model. It allows to express bounding boxes (at the level of observations). If I read the comments above, this proposal is something that @kbubnicki @ben-norton @jniedballa and now @tucotuco could get on board with. I will create a pull request with the suggested changes. Thank you all for your patience and for participating in this discussion! @danstowell you liked the possibilities of the 4th table approach - maybe especially as a model for Audubon Core - but for Camtrap DP we believe it would needlessly complicate things as an exchange format. Hope you understand. Rename to eventIDOne change we suggest is to rename Image-based (if we reuse identifiers):
Sequence-based:
|
Quick update: we are still working on restructuring the model. The current approach is to abandon trying to capture image vs event-based annotation in a single The main advantage is clarity: easier for the user to understand and easier for us to document. Additionally, it allows to export both approaches in a single package, e.g. AI image-level observations that underpin event-level consensus observations. We are currently testing this approach and hammering out the details. |
The suggested change (splitting the observation table) has been implemented in #289. All who participated here are welcome to review the changes. |
Fixed in Camtrap DP 0.6 #297. |
Congrats. That's a very challenging task. |
In a discussion with @tucotuco on how to better align Camtrap DP with a common model for biodiversity data, a proposal came up on how to better structure sequences in Camtrap DP.
Preamble
For the purpose of this discussion, I want to clarify what we mean by a sequence here:
sequence interval
"Maximum number of seconds between timestamps of successive media files to be considered part of a single sequence". As a result, a sequence can contain multiple triggers/bursts.sequence interval
is not a camera setting, but one by the programme used to manage the images afterwards.sequence interval
settings that were chosen. With image-based observations you can choose yourself how to group images together in logical events based on their timestamp.This proposal is not about whether image-based observations are better than sequence-based observations. The current situation is that both approaches exists (and likely will for a while) and Camtrap DP wants to support both.
The examples show a how the data would look for 3 images, using image-based vs sequence-based observations. In the first 2 images a wild boar (Sus scrofa) can be seen.
Current situation 0
sequenceID
), in both media and observations.sequenceID
andmediaID
, which are both foreign keys to the media table. Image-based observations need to populate both, sequence-based observations onlysequenceID
. As a result, joins between observations and media are conditional: you kinda need to know what key to use to make a join that will yield results. That is not great.deploymentID
andtimestamp
to observations, so that they can be easily joined with deployments - without having to go over media - to get useful biological data (location, time, species).Image-based observations
Sequence-based observations
Suggested change 1
In media.csv
parentMediaID
to associate them with sequences. That allows joins to find the images that belong to a sequence.filePath
andfileMediaType
become optional fields. They are typically not populated for sequence rows.In observations.csv
mediaID
. That media row can be a single image (image-based observations) or a sequence. This is a huge benefit, as it no longer required conditional joins.Most importantly, we think this model better represents the actual situation with camera traps: deployments → generate media → generate observations
Image-based observations
Sequence-based observations
Suggested change 2 (an less drastic update to the current situation)
This was suggested in #203 (comment). Comments above that are about suggested change 1 only.
The text was updated successfully, but these errors were encountered: