This extension provides several helpful functionalities for OpenRefine users who want to edit (structured data of) media files (images, videos, PDFs...) on Wikimedia Commons. For more info, documentation and how-tos about OpenRefine for Wikimedia Commons, see https://commons.wikimedia.org/wiki/Commons:OpenRefine.
Features included in this extension:
- Start an OpenRefine project by loading file names from one or more Wikimedia Commons categories (including category depth)
- Add columns with Commons categories and/or M-ids of each file name
- File names will already be reconciled when starting the project
- A few dedicated GREL commands allow basic processing and extraction of Wikitext:
extractFromTemplate
andvalue.extractCategories
- (In this extension's 0.1.1 release and later) Basic support for file thumbnail previews of existing Wikimedia Commons files. Thumbnails are displayed for some (but not all) file types/extensions. There is currently thumbnail support for jpeg, gif, png, djvu, pdf, svg, webm and ogv files.
It works with OpenRefine 3.6.x and later versions of OpenRefine. It is not compatible with OpenRefine 3.5.x or earlier. (OpenRefine supports editing Wikimedia Commons from version 3.6; this is not possible in earlier versions.)
This extension was first released in October 2022. It has been funded by a Wikimedia project grant.
Download the .zip file of the latest release of this extension. Unzip this file and place the unzipped folder in your OpenRefine extensions folder. Read more about installing extensions in OpenRefine's user manual.
When this extension is installed correctly, you will now see the additional option 'Wikimedia Commons' when starting a new project in OpenRefine.
After installing this extension, click the 'Wikimedia Commons' option to start a new project in OpenRefine. You will be prompted to add one or more Wikimedia Commons categories.
There's no need to type the Category: prefix.
You can specify category depth by typing or selecting a number in the input field after each category. Depth 0
means only files from the current category level; depth 1
will retrieve files from one sub-category level down, etc.
Next, in the project preview screen (Configure parsing options
), you can choose to also include a column with each file's M-id (unique MediaInfo identifier) and/or Commons categories.
File names will already be reconciled when your project starts.
When you load larger categories (thousands of files) in a new project, OpenRefine will start slowly and will give you a memory warning. This is a known issue. Wait for a bit; the project will eventually start. The Commons Extension has been tested with a project of more than 450,000 files.
The Wikimedia Commons Extension also enables two dedicated GREL commands, which help to extract specific information from the Wikitext of Wikimedia Commons files. (GREL, General Refine Expression Language, is a dedicated scripting language used in OpenRefine for many flexible data operations. For a general reference on using GREL in OpenRefine, see https://docs.openrefine.org/manual/grelfunctions.)
Firstly, retrieve the Wikitext from a list of Commons files in your project. In the column menu of the reconciled file names' column, select Edit column
> Add column from reconciled values...
and select Wikitext
in the resulting dialog window.
From this new column with Wikitext, you can now extract values and categories as described below. Start by selecting Edit column
> Add column based on this column...
in the column menu. In the next dialog window, you can use various specific GREL commands:
Use the following syntax:
extractFromTemplate(value, "BHL", "source")[0]
where you replace BHL
with the name of the template (without curly brackets) and source
with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385
.
Use the following syntax:
value.extractCategories().join('#')
This GREL syntax will return all categories mentioned in the Wikitext, separated by the #
character, which you can then use to split the resulting cell further as needed.
Run
mvn package
This creates a zip file in the target
folder, which can then be installed in OpenRefine.
To avoid having to unzip the extension in the corresponding directory every time you want to test it, you can also use another set up: simply create a symbolic link from your extensions folder in OpenRefine to the local copy of this repository. With this setup, you do not need to run mvn package
when making changes to the extension, but you will still to compile it with mvn compile
if you are making changes to Java files, and restart OpenRefine if you make changes to any files.
- Make sure you are on the
master
branch and it is up to date (git pull
) - Open
pom.xml
and set the version to the desired version number, such as<version>0.1.0</version>
- Commit and push those changes to master
- Add a corresponding git tag, with
git tag -a v0.1.0 -m "Version 0.1.0"
(when working from GitHub Desktop, you can follow this process and manually add thev0.1.0
tag with the descriptionVersion 0.1.0
) - Push the tag to GitHub:
git push --tags
(in GitHub Desktop, just push again) - Create a new release on GitHub at https://github.com/OpenRefine/CommonsExtension/releases/new, providing a release title (such as "Commons extension 0.1.0") and a description of the features in this release.
- Open
pom.xml
and set the version to the expected next version number, followed by-SNAPSHOT
. For instance, if you just released 0.1.0, you could set<version>0.1.1-SNAPSHOT</version>
- Commit and push those changes.