Skip to content

Client-side replacement for Wikipedia's Copyvios detector, specific URLs only

License

Notifications You must be signed in to change notification settings

DavidWBrooks/Copyvios

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Copyvios

Client-side replacement for Wikipedia's Copyvio detector with respect to specific URLs

Build with Visual Studio 2017 or greater. The executable runs on Windows, and needs .NET Framework 4.5 or greater. Start it, enter (a) a Wikipedia article name and (b) another URL, and click (or press Enter). You can put the two parameters on the command line for easier scripting.

This is a clean-room implementation without reference to any advanced language analysis or efficient text-matching logic (i.e. no Myers algorithm) but seems pretty close to what the wmflabs tool does, in this limited scenario. It only takes a second or two; except for very large articles most of the time is spent downloading the two documents. I wrote this for myself, and the language processing for the URL is designed only for a certain class of Wikisource articles. Other types of document will be ugly. That's probably the area most in need of improvement.

Theory of operation (along with a few practical tweaks): *Load two documents and reduce them to plain text *Reduce to a list of hashes of (alphanumeric) words *Combine word hashes into grams *Look for matching grams between the sources *Use the extents of the matches to color the plain text.

Technical notes: I tried a dozen nGram-consolidating algorithms, and have combined two of them. The goal is to maximize matches and minimize false positives (there's no point in highlighting every "and" in the two sources). The algorithm may undergo more refinement.

I did originally try to make it do the Right Thing and execute asynchronously and interruptible, but ran into a strange thread-conflict exception. There are left-over traces of that code in this synchronous version. The raw processing turned out not to take as long as I anticipated.

About

Client-side replacement for Wikipedia's Copyvios detector, specific URLs only

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages