Skip to content

Latest commit

 

History

History
71 lines (57 loc) · 2.86 KB

README.md

File metadata and controls

71 lines (57 loc) · 2.86 KB

multivac-wikipedia GitHub license Build Status Multivac Discuss Multivac Channel

Wonderful reusable codes, libraries and scripts to process Wikipedia dumps (page content, page views, etc.) by using Apache Spark (SQL, ML, and GraphX).

build_pageviews

This repo represents:

  • Download hourly pageviews for entire Wikipedia projects (daily)
  • Cleaning up and creating DataFrame
  • Save DataFrame as increamentally and dynamically partitioned parquets

Read more about this repo

Showcase

Wikipedia PageViews in December 2017

Number of rows: 4,529,669,792 (4.5 billion)

Sum of requests: 15,278,050,138 (15.3 billion)

+---------+-------------+
|project  |sum(requests)|
+---------+-------------+
|en.m     |3784911811   |
|en       |3632828923   |
|ja.m     |578906226    |
|ru       |532707570    |
|es.m     |507966307    |
|de.m     |464186949    |
|de       |463264619    |
|ja       |379715338    |
|ru.m     |369216509    |
|fr.m     |361069999    |
|it.m     |328056166    |
|fr       |318697185    |
|es       |314862963    |
|zh       |206919597    |
|pt.m     |172852499    |
|it       |161235234    |
|zh.m     |149878515    |
|ar.m     |127827169    |
|pl       |125353954    |
|pl.m     |113954004    |
|pt       |108576418    |
|commons.m|105668930    |
|id.m     |89284575     |
|fa.m     |88369910     |
|nl.m     |77441421     |
|nl       |67609149     |
|sv.m     |57038991     |
|en.zero  |52135201     |
|www.wd   |48210254     |
|ar       |42496761     |
+---------+-------------+
only showing top 30 rows

Read more

Testing Environment

  • Spark 2.2 Local / IntelliJ
  • Spark 2.2 / Cloudera CDH 5.13 / YARN (cluster - client)

Code of Conduct

This, and all github.com/multivacplatform projects, are under the Multivac Platform Open Source Code of Conduct. Additionally, see the Typelevel Code of Conduct for specific examples of harassing behavior that are not tolerated.

Copyright and License

Code and documentation copyright (c) 2017-2019 ISCPIF - CNRS. Code released under the MIT license.