crowdsourcing Metadata #2455

synctext · 2016-07-10T19:55:27Z

Allow any user to improve the metadata. Examples of existing approaches:

Time slicing is too heavy for Tribler, out of scope. Just metadata.

lfdversluis · 2016-07-10T20:30:31Z

A wiki-like approach? What would be editable within and outside Tribler?

synctext · 2018-01-19T12:53:58Z

Refocus this issue on torrent channels based on magnet links with rich metadata. Thesis goal is to deploy creation and search of rich metadata. More ambitious goal would be to do integrating of voting to obtain trustworthy metadata.

Each channel can only be modified by the channel owner. Users vote on the quality of channels and quality emerges from the Tribler user collective. It is easy to copy an entire channel, re-use quality content from a channel, and re-mix channels. Thus leading indirectly to crowdsourcing. We specifically avoid the problem of edit wars, collaborative editing, and undo changes. This leads to a realistic master project. Only with an active community in place, we can move to the next stage and experiment with more sophisticated crowdsourcing models.

The torrent channel is expanded with a new type: rich metadata channels. Channels owners indicate the content type of each magnet link. We still keep it simple: 1 magnet link has 1 rich metadata description. First step is to create a simple editor inside Tribler. It supports several content types. Combining enriching of music, podcast, movies, vloggers, scientific articles, etc. :

Next step is then creating a new rich metadata search community.

This master thesis puts the foundations in place for rich metadata. In the future we want to integrate collaborative tools. For instance, we assume scientific papers are available in Tribler and we can semi-automatically create survey papers. Still out of scope:

synctext · 2018-02-12T09:37:34Z

Demonstrates the content type idea (radio button or drop down):

Combining enriching of music, podcast, movies, vloggers, scientific articles, etc
A lot of metadata frameworks exist, you could select this very old one. Keep data model very simple, never more then 4 fields per content type. We see how users like it, then expand based on real-world feedback. Please don't try to get it right the first time!

svanschooten · 2018-02-19T09:06:05Z

@devos50 Do you have time to walk through the stack this afternoon? I have a doctors appointment in an hour, so I'll be at the lab after lunch. I've seen most of the code, and have looked into QT last week, but it still is a bit fuzzy.

synctext · 2018-02-19T10:53:11Z

He is on vacation this week. But others should be able to help.

svanschooten · 2018-02-19T15:20:10Z

First step: adding a field containing a single metadata entry (content type in this case, based on the Youtube VideoCategories API), and starting to structure the metadata information.
Started implementing a Metadata community, still bit fuzzy on what is needed there.

svanschooten · 2018-02-20T16:26:34Z

Got my community running, peers can exchange messages (directed and broadcast), next step is to define the behavior of the community and the message types that can be sent.

@synctext should I look at a more branching metadata structure such as here (inspired by youtube, piratebay and dublincore).
The other option would be to flatten the structure and create a more generic metadata structure (not following the classic and modern frameworks), and create something more generic. What would you advise?

Next step would be to structure the database and messages sent.

devos50 · 2018-02-25T15:11:20Z

@svanschooten I would advise to keep it as simple as possible for now and not go wild with many different metadata types/complicated structures yet.

Also, nice to see that you have a basic community up and running!

svanschooten · 2018-02-26T13:44:23Z

@devos50 welcome back! I agree, that is why I re-researched the desired structure, when I have something solid I'll implement a data structure which I can store in the database (also start implementing the distribution mechanics).

Due to a family crisis I have not been able to come to the lab Friday and today, but I have done some reading and thinking on the categorization issue: most content management systems use a tree based structure to define archetypes, subtypes and properties, though I have come across some interesting work. Twitter has published a content categorization method that looks interesting, though it is not directly applicable to our case.

Based on these articles and papers I have opted to design a more 'flat' category structure, which I have documented on my repository.

synctext · 2018-02-27T17:35:35Z

#1150 is about to start soon. First finish this quick MAX 4 week prototype, then think how to build on top of scalable channels. When the're hopefully ready!

Write rich metadata on Trustchain? (e.g. so barter records, voting for channels, trading honesty, and metadata enrichment). Then we have 4 contexts of reputations to merge somewhat. Next step is to remove all non-blockchain data sync mechanisms in Tribler... Remove all storage in Dispersy #2778, all communities, and replace it with IPv8-based Trustchain storage.

Keep it simple-and-get-it-running-first-you-stupid model: only channel owner can do metadata enrichment :-)

svanschooten · 2018-02-27T20:17:24Z

Fixed my packing issue, added category based payload types and added them to the community communication.
First working UI is done, now to couple the UI to the community: This includes basic parsing of datatypes based on regex (simplest method for now).
Major overhaul to the ContentType and Category models to make them easily adaptable.

The fields are dynamically added with the accessory parsing method, field name and label.
These are based on the fields defined in the Category models.

Only problem is that the community now can't discover the other peers, so the test script for the community does not receive anything... Test script checks is peer list is not empty each second, but stays empty.
edit: stupid me used loop in reactor thread, derp...

svanschooten · 2018-03-02T14:09:10Z

Implemented a lot more generics today, makes constructing metadata Categories much easier without a test for all types.
Created an endpoint in the REST interface to let the UI talk to the MetadataCommunity.
Added MetadataCommunity to the config and LaunchManyCore.
Community mini-test working with twisted

Next (@synctext ??): writing metadata to persistence layer, more UI screens (only on torrent add for now) or better metadata models?

svanschooten · 2018-03-04T10:35:58Z

Looking at most metadata models, they approach it from an unstructured data angle, they usually have a (semi-) fixed tree structure for fields, but no simple and straightforward approaches to storing it using a relational database:
This paper uses a generic field implementation with a mapping algorithm.
If also found this paper using oldschool RDF.
A patent that I can not understand....
RFC-ish description of how dublincore was designed.
Theses guys developed a xml storing mechanism.
This pretty decent explanation on how you should see and organise metadata.

I do not want to introduce more dependencies, but I think a noSQL storage method would be easiest?
Or maybe something generic like:

MetadataTable:
- (id) ID
- (string) infohash
- (string) title
- (string) category
- (string) content type

FieldsTalbe:
- (id) ID
- (id) metadata ID
- (string) name
- (string) value

synctext · 2018-03-05T10:19:09Z

MetadataTable:

(id) ID

(string) infohash

Why do you want to make the infohash more unique with an ID ? :-)

Consider adopting distributing scientific works as your test community for your entire thesis {or something else additionally; http://bt.etree.org}. Or create a tool and test how many hours it takes to put stuff like 400k scientific journals in your rich metadata. a.k.a. Giga-Scraper idea. Next step: finish prototype and create a .pdf seeding channel.

Just make an music table, movie, clip, series, vlog, ebook, adult entertainment, other ~~images~~ table etc. Keep it simple for your 4-week prototype. try to remove content and subtype construct: just 1 category level. ID3 simple, for instance, no fancy XML, nosql, or RDF. Just a fixed structure please. Probably per content type. In 1996 Eric Kemp created ID3, the defacto framework for audio metadata. Strings are either space- or zero-padded. Unset string entries are filled using an empty string. ID3v1 is 128 bytes long. Table with fields is copied from Wikipedia

Field	Length	Description
header	3	"TAG"
title	30	30 characters of the title
artist	30	30 characters of the artist name
album	30	30 characters of the album name
year	4	A four-digit year
comment	28 or 30	The comment.
zero-byte	1	If a track number is stored, this byte contains a binary 0.
track	1	The number of the track on the album, or 0. Invalid, if previous byte is not a binary 0.
genre	1	Index in a list of genres, or 255

ID3v1 pre-defines a set of genres denoted by numerical codes. Keeps it trivial...

Future: #3484 After this 4-week prototype is completed, explore more advanced architecture. We are prototyping using our Trustchain idea as the only storage paradigm in Tribler. It would contain: bandwidth barter transactions, voting for channels, trading of bandwidth coins #3326. Additionally, possibly rich metadata of channels; this thesis. Warning: this idea for yet another Tribler overhaul would take years to complete and get stable!

svanschooten · 2018-03-05T16:47:59Z

The underlying data model has bee simplified and abstracted more, to provide generic reading and setting handles. Is completely flat now.
Also a basic database and repository implementation is done, both with an in-memory and persistent layer.

TODO:

insert metadata into the database from the community.
extend the code with docs.
an UI implementation for showing the metadata has to be implemented.
tests

svanschooten · 2018-06-08T14:07:06Z

Refined thesis subject: Searching in enriched metadata using deduplicated tag clouds.

Language mixing is a major problem
Use tokeninzing to revert word to stem form, then linking it in the tag cloud.
Use k-means clustering to find subclouds to create more rigid linking, overlap in clouds could indidate same/similar entries (partly removes duplication, word polymorphism and locality issues)
Cloud distribution during search by receiving k-linked clusters from near neighbours.
Fluid metadata structure, community defined data.
Voting on better tags will create weighted clouds, making deduplication easier.

devos50 · 2018-06-08T14:32:38Z

It might be helpful for you to sync with @xoriole, your ideas seems to overlap somewhat.

xoriole · 2018-06-08T15:08:42Z

@devos50 It was quite a long discussion. Lots of ideas floating. We'll see how the design materializes.

ichorid · 2021-09-28T14:03:22Z

related to #6217

synctext · 2022-05-22T06:14:52Z

Related: tagging and RSS historical analysis plus failings

synctext added the type: enhancement label Jul 10, 2016

synctext added this to the Backlog milestone Jul 10, 2016

synctext self-assigned this Jul 10, 2016

qstokkink added the long-term label Nov 10, 2017

synctext assigned svanschooten and unassigned synctext Jan 19, 2018

synctext mentioned this issue May 7, 2018

Redesign of the Search/Channels feature #3615

Closed

ichorid modified the milestones: Backlog, Next-next release Jun 12, 2020

ichorid unassigned svanschooten Jul 17, 2020

ichorid added Epic and removed long-term labels Jul 17, 2020

xoriole mentioned this issue Aug 27, 2020

Checklist for user growth #5545

Closed

drew2a added the was in next-next label Nov 4, 2020

drew2a modified the milestones: Next-next release, Backlog Nov 4, 2020

drew2a added component: channels and removed was in next-next labels Jan 15, 2021

ichorid mentioned this issue Nov 3, 2021

Vadim's testament #6481

Closed

synctext mentioned this issue Oct 27, 2022

Extended GUI to allow editing of metadata #7099

Closed

devos50 mentioned this issue Oct 27, 2022

Added edit metadata GUI elements #7112

Merged

drew2a closed this as completed in #7112 Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crowdsourcing Metadata #2455

crowdsourcing Metadata #2455

synctext commented Jul 10, 2016

lfdversluis commented Jul 10, 2016

synctext commented Jan 19, 2018

synctext commented Feb 12, 2018 •

edited

Loading

svanschooten commented Feb 19, 2018

synctext commented Feb 19, 2018

svanschooten commented Feb 19, 2018

svanschooten commented Feb 20, 2018 •

edited

Loading

devos50 commented Feb 25, 2018 •

edited

Loading

svanschooten commented Feb 26, 2018

synctext commented Feb 27, 2018 •

edited

Loading

svanschooten commented Feb 27, 2018 •

edited

Loading

svanschooten commented Mar 2, 2018

svanschooten commented Mar 4, 2018 •

edited

Loading

synctext commented Mar 5, 2018

svanschooten commented Mar 5, 2018

svanschooten commented Jun 8, 2018 •

edited

Loading

devos50 commented Jun 8, 2018

xoriole commented Jun 8, 2018

ichorid commented Sep 28, 2021

synctext commented May 22, 2022 •

edited

Loading

crowdsourcing Metadata #2455

crowdsourcing Metadata #2455

Comments

synctext commented Jul 10, 2016

lfdversluis commented Jul 10, 2016

synctext commented Jan 19, 2018

synctext commented Feb 12, 2018 • edited Loading

svanschooten commented Feb 19, 2018

synctext commented Feb 19, 2018

svanschooten commented Feb 19, 2018

svanschooten commented Feb 20, 2018 • edited Loading

devos50 commented Feb 25, 2018 • edited Loading

svanschooten commented Feb 26, 2018

synctext commented Feb 27, 2018 • edited Loading

svanschooten commented Feb 27, 2018 • edited Loading

svanschooten commented Mar 2, 2018

svanschooten commented Mar 4, 2018 • edited Loading

synctext commented Mar 5, 2018

svanschooten commented Mar 5, 2018

svanschooten commented Jun 8, 2018 • edited Loading

devos50 commented Jun 8, 2018

xoriole commented Jun 8, 2018

ichorid commented Sep 28, 2021

synctext commented May 22, 2022 • edited Loading

synctext commented Feb 12, 2018 •

edited

Loading

svanschooten commented Feb 20, 2018 •

edited

Loading

devos50 commented Feb 25, 2018 •

edited

Loading

synctext commented Feb 27, 2018 •

edited

Loading

svanschooten commented Feb 27, 2018 •

edited

Loading

svanschooten commented Mar 4, 2018 •

edited

Loading

svanschooten commented Jun 8, 2018 •

edited

Loading

synctext commented May 22, 2022 •

edited

Loading