Skip to content

Primary Data Pipeline

Paul Houle edited this page Jan 10, 2014 · 2 revisions

This proposal is based on the preliminary research done on monthly sums.

Initial Filtering

I learned that the vast bulk of entries in the "en" namespace are things that occur fewer than 10 times a month. Since we were seeing something like 100 million+ URIs rather than the 10 million URIs that really exist in the "en" Wikipedia. Now some of these might be Special: pages, Talk: pages and stuff like that,

Thus, an obvious way to reduce the size of the data set (i.e. storage costs) and to speed up/cheapen the processing is to filter out the junk URIs that will never be resolvable to concepts.

So now I am re-running the monthly averages, except now I am looking at all the records, not just the "en" records.

The plan is to go back through the hourly files and keep only records that were accepted for the month that the hourly files came from. This should save a lot of storage space, although we'll give some back if we have to put timestamps into the files.

I think it makes sense to use a bloom filter for the above task because it doesn't hurt anything if we have a small percentage of false positives. Right now I think the wanted records are about 10% of the records, and if we had a 1% false postive rate, that increases the data size to 11%, which doesn't hurt much.

I still don't know if it is best to gzip or bzip the new hourlies. Looking at it from a cost basis, it's really a matter of the cost of long term storage vs. the cost of processing.

Resolution to topics

An obvious thing to do next is to resolve the URIs to concepts. I considered two possibilities here: one is to resolve to DBpedia concepts and the other is to resolve to :BaseKB concepts. I rejected the latter, because I'd like to record the usage of "List of X", "Category" and other kinds of pages that don't exist in Wikipedia. The main step here will be following the 'redirect' list of concepts and rejecting anything that doesn't resolve against DBpedia.

(What about Talk: pages?)

New monthlies

One thing I noticed with the old monthlies is that some topics have artificially high importance because they were featured as Google Doodles for a day or featured on the Wikipedia top page. These could be chopped off by, say, rejecting the 8% highest (say 2 days worth) values and taking the mean of the rest.

SubjectiveI 3D

It would be reasonable to sum up the monthlies to produce a '3D' version of the product. One issue is that the Wikipedias are more popular now than they used to be, so that it might make sense to normalize them to 2013 traffic levels.

(I've come to the conclusion that diurnal changes really do reflect change in "how much people are thinking about things", that is, if you are asleep and not visiting Wikipedia, you aren't thinking about anything. On the other hand, the ratio of "people reading Wikipedia" to "people thinking about things" has increased secularly over time. I'm not so sure about the meaning of seasonal variation, so normalizing at the month level seems reasonable to me.)