A web app for crowdsourcing transcription, written in PHP and JavaScript. Licensed under the MIT license.
- Twig (1.9.2 included)
- sfYaml (included)
- uploadify (2.1.4 included)
- MediaElement.js (2.10.0 included)
- ffmpeg (for audio transcription)
- Clone this repository or unpack the files.
- Create a database and user in MySQL.
- Copy
config.sample.yaml
toconfig.yaml
and edit it. - If you want to transcribe audio, install ffmpeg and set the path to it in
config.yaml
. - Create the directory
htdocs/media
and give your web server (Apache, nginx, etc.) write rights to it. - Set your web server to point to
htdocs
for the site's document root. - In your
php.ini
, setupload_max_filesize
to something big enough (128M
, etc.). - In your
php.ini
, setpost_max_filesize
to something big enough (128M
, etc.). - In your
php.ini
, setmax_file_uploads
to something big enough (200
, etc.). - Go to
/install
in your browser and create an admin account.
After installation, you'll end up on the dashboard. You can create a new project from here, or later from the Admin page. There are two project types:
- System: These go under
/projects/[slug]
. Generally intended for where the installation is for a single project (like the Mormon Texts Project). - User: These go under
/users/[username]/projects/[slug]
. Mostly intended for small private projects. (But they don't have to be private or small.)
And there are three system roles for users:
- User: proofers and reviewers
- Creator: can create user projects
- Admin: site admin, can create system projects and user projects
Choose the project type you want, then click the button. You can now fill out the project title, visibility (public or private), and an optional description and language. Public projects are visible to all users on the Projects page. Private projects are invisible to all except for the people who are members of the project.
Click Create Project
. You'll be taken to the project admin page, where you can edit the project in detail.
- Project Members: This is for adding users to private projects, adding reviewers and admins, and removing people from projects. Type in the username, choose the role, and click
Add
. To remove a user, click the X to the right of their name. - Custom Item Fields: If you want to add extra fields for users to fill out while transcribing, this is the place. See the examples.
- Status: Can be
Pending
,Active
, orCompleted
. Only active projects show up on the projects page. - Workflow: For now, leave as
@proofer, @proofer, @reviewer
. - Characters: Space-separated list of characters to be put in the character pad (on the proofing page).
- Download Template: The template for each item when downloading a final transcript. Variables can be included using double curly brackets. Available variables:
{{ transcript }}
-- the transcript itself (if there are reviews, this collates each item's reviews; if there are only proofs, this collates each item's proofs){{ item.title }}
-- the item title{{ item.id }}
-- the item ID #{{ item.type }}
-- the item type (page, audio, etc.){{ item.status }}
-- the item status{{ item.href }}
-- the item href (URL for filename){{ project.title }}
-- the project title{{ project.public }}
-- the project's visibility (public or private){{ project.slug }}
-- the project slug{{ project.language }}
-- the project language{{ project.description }}
-- the project description{{ project.owner }}
-- the project owner (a username){{ project.status }}
-- the project status{{ project.guidelines }}
-- the project guidelines{{ proofers }}
-- a comma-separated list of the users who proofed this item{{ reviewers }}
-- a comma-separated list of the users who reviewed this item{{ fields.field\_slug }}
-- the fields filled out by the user (the field slug is generated by taking the field name, lowercasing it, and replacing spaces with underscores -- for example, "Page Number" becomes "page_number")
On the project admin page, click SELECT FILES
and upload the files you'd like to add to the project.
- Page: JPEG/PNG/GIF images
- Audio: MP3 files
For images, one item will be added per file. For audio, each MP3 will be split into segments (the length of the segment is set in the config file), and one item will be added per segment.
If you're doing OCR correction, the way to get the OCRed text into the project is by importing the transcript. Click the Import Transcript
button.
On the right there's a list of the items that text will be imported for. If you want to get rid of any (because of blank pages or whatnot), click the X. This will only remove the items from the import procedure and won't delete them.
Paste your pre-existing text into the Transcript box. Then edit the template so it matches the transcript. The format for the template is a regex that matches the named group text
. For example, take the following pre-existing transcript:
<page>The text for page 1.</page>
<page id="2">The text for page 2.</page>
<page id="3">The text for page 3.</page>
This template matches the above text:
<page.*?>(?P<text>.*?)<\/page>
There is a live preview so you can make sure things match up. The number of items generated in the preview needs to match the number of items in the right sidebar.
Once everything's ready, click Import Transcript
.
Add yourself as a proofer (in the Project Members area), then click Save and Activate
. (This is the same as changing the project status to Active
and clicking Save Project
.) Now click on the Dashboard
link.
You should see your project in the Proofing Queue area. Click Get new item
. This assigns the next available item to you and sends you to the proofing page.
Fill out the transcript. For page images, you can click on the image to set a highlight bar that helps you keep track of where you are.
Buttons on the proofing page:
Project Guidelines
: show guidelines for the project (if any have been set up)Characters
: toggle a character pad where you can easily add characters that are hard to typeSave draft
: save the current transcript as a draft and return to the dashboardFinish
: finish the current transcript and return to the dashboardFinish & Continue
: finish the current transcript and proof the next item
When the requisite number of proofs for an item have been finished, the item will move to the review queue. Reviewing is similar to proofing, except that a review will concatenate existing proofs and populate the transcript box with the difference between them.
After proofing and reviewing is done, return to the project admin page and click Download Transcript
in the upper left to download the final transcript, using the download template as described above.
Each entry in config.yaml
, explained in more detail.
db
: The database engine. Right nowMySQL
is the only option. To add a new database adapter, create a new directory inmodules/db
with the name of the adapter, then create aDbYourAdapterName.class.php
file in that directory. And change thisdb
entry to point to it. The adapter will need to expose all the functions found inDbMySQL.class.php
.auth
: The authentication engine. Right nowAlibaba
is the only option. To add a new authentication adapter, create a new file inmodules/auth
with aAuthYourAdapterName.class.php
file.database
: database settings. Generally, you'll want to leave host as127.0.0.1
(localhost), but you'll need to setdatabase
to the name of the database you've created,username
to the database user's username, andpassword
to the database user's password.title
: The page title for this installation of Unbindery. Shows up in the upper left of all pages.app_url
: Ignore the&SITEROOT
at the beginning -- it's used so you don't have to define this again later on. Changehttp://path/to/unbindery
to the URL you're hosting Unbindery at.sys_path
: The filesystem path to Unbindery.admin_email
: Email address for the site administrator.language
: Default isen
. If you change this, make sure you have a corresponding file in thetranslations/
directory.email_subject
: A string that is prepended to any notification emails. Can be blank.theme_cached
: If you want the Twig templates cached, set this to true.theme
: The system theme. Default iscore
.external_login
: If you change the auth adapter to something else (like Google account), set this to true.allow_signup
: Whether to show the signup link on the home page. Only applies ifexternal_login
is false.download_template
: The default download template for new projects.alibaba
: Configuration options for Alibaba. You shouldn't need to change anything here.system_guidelines
: guidelines to show up on every project page before project-specific guidelines. HTML.private_key
: Used for web service authentication. Not currently used.devkeys
: Used for web service authentication. Not currently used.google_analytics
: If you want to hook your installation of Unbindery up to Google Analytics, put your UA code here.scoring
: How many points users get for proofing and reviewing.editors
: A list of editor types (page
andaudio
are the defaults), withcss
orjs
arrays for CSS/JS files to be included (paths are relative tohtdocs/themes/[theme]/[css|js]/editors/[editor-type]
).uploaders
: A list of item uploader module types. The corresponding PHP files are found inmodules/uploaders
. For example, the code for thePage
uploader type is inmodules/uploaders/PageUploader.class.php
. Each uploader type has anextensions
field which is an array of extensions this item type uploader handles. Additional options may also be set (as seen in theAudio
uploader options).notifications
: A list of notifications, with an array of targets for each. Targets can be@user
,@projectadmin
, or@admin
. If you add new notifications, make sure you add them to the translations file as well.
To add a new editor type (for example, one with XML tagging support), you need to do the following:
- Add the editor type to
config.yaml
(see above). - Add the HTML for the editor at
templates/core/editors/[editor-name].html
- Add the CSS for the editor at
htdocs/themes/core/css/editors/[editor-name]/[editor-name].css
- Add the JavaScript for the editor at
htdocs/themes/core/js/editors/[editor-name]/[editor-name].js
You can look at the existing editors to see how things are set up.
To add a new item uploader type (for example, one that takes a PDF and splits it into individual pages and converts each page to a JPEG), you need to do the following:
- Add the uploader type to
config.yaml
(see above). - Add the PHP for the uploaer at
modules/uploaders/[UploaderName]Uploader.class.php
An uploader extends the ItemTypeUploader
class and defines the preprocess
function, which takes a list of filenames ($filenames
) and processes them. The uploader needs to set the following properties:
$this->files
: the final array of files to be moved into themedia
directory. For example, the PDF splitter would create a new entry in$this->files
for each page of the original PDF.$this->itemData
: an array of$item
(itself an array oftitle
,project_id
,transcript
,type
, andhref
), used to add the items to the database. Needs to have the same number of items as$this->files
.
See the audio uploader type for an example.
- Thanks to Ryan Martinsen for his fork of Raymond Hill's FineDiff.