-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add slide text and segments to Tobira harvest API #5757
Conversation
.filter(mpe -> { | ||
final var flavor = mpe.getFlavor(); | ||
final var isCatalog = mpe.getElementType() == MediaPackageElement.Type.Catalog; | ||
final var isXml = mpe.getMimeType().eq(MimeType.mimeType("text", "xml")); | ||
final var isText = flavor.getSubtype().equals("text"); | ||
return isCatalog && isXml && isText; | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to only search for element with flavor mpeg-7/text
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, that is much simpler and more accurate. Thank you!
The slide texts are to be added to Tobira's search index, and in order to do so, they need to be harvested. This adds the generated ocr results to the `Item` class used in the Tobira module.
This adds a function to collect generated slide segments and add a corresponding timestamp to each. Tobira needs this to supply a frame list to the paella slide plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it locally and it works! I only have one note, but that's not too important. I think this can be merged.
final var slideText = Arrays.stream(mp.getElements()) | ||
.filter(mpe -> mpe.getFlavor().eq("mpeg-7/text")) | ||
.map(element -> element.getURI()) | ||
.findFirst(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder... what if there are multiple of these elements? You just take the first here. Can anyone say whether in the real world, there might be multiple such elements?
But this isn't a blocker. I also just trust Matthias with the mpeg-7/text
filter. I don't have enough experience to say whether that filters for less or more than what we want. 🤷
This seems to work to me. Have not checked the Tobira/UI side of things, but the server side stuff shows up just fine. |
…ella slide previews (#1163) This adds the ocr'd slide texts as well as a list of timestamped frames to the harvesting sync code and stores them in the DB. In order the show the slide previews, `paella-slide-plugins` was added and configured to use the timestamped frames. Needs opencast/opencast#5757 to work. Once that is merged, released and used on our test Opencast, the changes can be tested with fresh uploads. We'll still need some mechanism to apply segmentation and ocr (and speech-to-text as well) to existing videos. (Can be reviewed commit by commit, though note that the migration from the second commit was extended in the third)
See commits. With this PR, the generated slide text is exposed so we can use that in Tobira's search index. I'm not super sure if the filter I built for this could let through false positives or negatives, though in testing this didn't seem to be the case. Please let me know if you think this might be an issue, and/or have any suggestions how this could be solidified.
Furthermore, the slide segments with their respective starting time is passed in order to be used for Paella's slide plugin on the Tobira side.
Related Tobira issues: elan-ev/tobira#368, elan-ev/tobira#1065
Your pull request should…