GitHub - HughP/MLKA-Bash-data: Bash script for mashing mlka data

#MLKA-Bash-data The purpose is to run data through UnicodeCcount and TECKit enmass. The script then processes the data counts produced against various keyboard layouts and gives evaluations on what the keyboard layout efficentcy is for the text of a specific language.

This script is written in bash for processing and computing MLKA data. While some bash code does some functions, mostly bash functions as the glue pulling together several other programs written in other languages. Some python code referenced, and some perl dependencies are required.

There are three related repositories:

The MLKA project
- https://github.com/HughP/MLKA
MLKA-Test-Data which is a set of test data for testing and building this script.
- https://github.com/HughP/MLKA-Test-Data
Keyboard-File-Types which is simply a data array of various types of keyboard layout file types. This is used as a submodule for settings.
- https://github.com/HughP/Keyboard-File-Types
MLKA Keyboards which is a repository of keyboard layout files and their descriptions. This is used as a submodule for keyboard data.
- https://github.com/HughP/MLKA-Keyboards

This is written to work on OS X and linux. Tested on:

OS X 10.6.8 & 10.9.5
Ubuntu

##Requires these dependencies

UnicodeCCount - version 0.3

A sub-dependency here is Perl - We do not check for Perl. We do check for UnicodeCCount.
The script will not be successful and will output an error if allkeys.txt is not present in your Perl instance. This is a requirement of UnicodeCcount to operate. The error message will say: Your Perl installation is missing the UCA keys file. Please download http://www.unicode.org/Public/UCA/latest/allkeys.txt and put a copy into the '/usr/lib/x86_64-linux-gnu/perl/5.20/Unicode/Collate' folder.

TECKit - version 2.5.4
Typing by Michael Dickens

git clone https://github.com/michaeldickens/Typing.git

CSVfix version 1.6 More info

hg clone https://bitbucket.org/neilb/csvfix.
OS X users are encouraged to use homebrew via brew install csvfix.

WikiExtractor Script extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory. This is a mirror of the script by Giuseppe Attardi - Which might actually be the original. http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

git clone https://github.com/bwbaugh/wikipedia-extractor.git.

Python (Preferably 2.7)

One option among many for OS X Users is brew install python.
Other Python dependencies:
pip
PyGal - for SVG production of graphs.
Pandas - A python module for data processing

Python scripts embeded in the .bash script:

WikipediaExtractor Cleaner by Matt Stave (with edits by Hugh Paterson III)
Stave + Paterson python script for counting digrams
Script for transposing/pivoting data in CSV files

JavaScript count by jkpat
Palaso-python module

##Roadmap

##Repo structure

.
├── Admin Stuff
│   ├── Deprecated-Scripts
│   │   ├── Research-Keyboards
│   │   └── cleancsv
│   ├── Keyboard-XML
│   │   ├── Apple CSS image
│   │   ├── Apple keyboard search
│   │   │   ├── Dark
│   │   │   │   ├── Apple Keyboard made via css3._files
│   │   │   │   └── css
│   │   │   └── Light
│   │   │       ├── Apple Keyboard made via css3._files
│   │   │       └── css
│   │   └── Ukrainian-Russian
│   ├── TMP_JD
│   ├── Test-Materials
│   │   ├── Igbo James text
│   │   ├── Igbo Keyboard Layout
│   │   └── Navajo
│   │       ├── Navajo James Text
│   │       └── Wikipedia extracted Text
│   └── Try this table
├── Data-Derived
├── Data-Source
│   ├── Data-James
│   ├── Data-Keyboard
│   ├── Data-Phonology-Orthography
│   ├── Data-Previous-Frequency-Stats
│   ├── Data-Wiki
│   └── TECkit-Files
├── Dependencies
│   ├── Data
│   │   └── MLKA-Test-Data
│   ├── Settings
│   │   └── Keyboard-File-Types
│   └── Software
│       └── wikipedia-extractor
└── Temp-Files
    ├── Input-Files-Lists
    └── Languages-Used

Admin Stuff

This folder is mostly for temporary stuff and working versions.

Organization of script files

Two .bash scripts are contained in root. awesome-script.bash and clean-up.bash. These script files call on other bash script and python files. These additional files are held in the Dependencies folder.

clean-up.bash

This script's purpose is to return the repo to a "clean" state so that the awesome-script.bash can run from start to finish. It was designed durring testing and devleopment. Therefore it's purpose is to return the repo to a state where testing can occur.

awesome-script.bash

This script's purpose is to implement the analysis. This script is why this repo exists.

Temp Files

The script awesome-script.bash creates lists which it stores as files. These lists are held in the Temp-Files folder.

Data-Source

The Data-Source folder houses where the source data is kept. Other copies of the data in various processed forms are created and housed in the Data-Derived folder.

##Corpus clean up process ###Wikipedia

download wikipedia data
Extractor script
Extractor cleaner
Paterson use of TECKit to clean residue left by Extractor cleaner
Typography character conversion by TECKit

###James

Remove SFM Markers i. Remove Verse ii. Remove Chapter iii. Remove Section headings iv. Create stated copy of text for reference.

##List of files The purpose of this section is to list the kinds of files and the quantity of files which are created and used during the data processing process. There are three kinds of files: those we start off with, temp-files which are created and then deleted by the script, and those which are generated along the way, but represent some type of analysis.

Some the files listed blow need to be intergrated into the outline below.


list of all characters supported by keyboard.
list of characters to be removed from text.
.map file to support the removal of characters.

.map file for each keyboard layout to transform the text to ASCII.
.tec file for each keyboard layout to transform the text to ASCII.

.map file

###Files we start off with

####Corpus Data

metadata file for corpus.
no touch copy of corpus.
working copy of corpus.

Types of Corpora

NT James
Wikipedia

####Stats Counts from other studies

Some languages have stats for character frequency. Some don't.

####Character Transforms

global Unicode to nfd mapping .map file.
global Typographical clean up. .map file to support the removal of typographical characters.
Corpus based clean up.
.map file for each keyboard layout to transform the text to ASCII.

####Keyboards

metadata file for keyboard.
text description for keyboard (how it works).
.kmn file for keyboard
.kmx file for keyboard
image of keyboard layout for layout.
Base image of keyboard for heatmap.
.keylayout file for keyboard.

###Temp Files

James-Corpus.txt - This file is used to create an output of the languages of James corpora. It is different than the file created by $JAMES_LIST_FILE which is called James-list.txt.

###Files Produced

####Corpus Data

Each working copy of each corpus has initial count: -d, -u, -c, -d -m,-m (6 files)
Each working copy of each corpus has second count: -d, -u, -c, -d -m,-m (6 files) -following the removal of SFM
Each working copy of each corpus has third count: -d, -u, -c, -d -m,-m (6 files) -following the removal of typographical characters.
Each working copy of each corpus has fourth count: -d, -u, -c, -d -m,-m (6 files) -following the conversion of Unicode text to ASCII equivalent for keyboard analysis.
list of characters to be removed from text.

####Character Transforms

.map file to support the removal of untypeable characters.
global unicode to nfd compiled mapping .tec file.
list of characters to be removed from text.
.tec file for each keyboard layout to transform the text to ASCII.
.tec file to implement the removal of untypeable characters.
.tec file to implement the conversions the typographical characters.

####Keyboards

image of keyboard for heatmap sample text.
image of keyboard for heatmap full text.
list of all characters supported by keyboard.

##Outline

Start
Metadata
Variables set
Directories
File Names
Other things
Dependencies
Software
Datafiles
Start with clean data processing folders
Clear files created
Clear compiled data
Look for data to move into data processing folders
Look for Wikipedia data
Look for James Data
Look for keyboard data

##Notes

###Notes about git.

Here is some thing about fetching from upstream and syncing your repo: https://help.github.com/articles/syncing-a-fork/

git remote -v
git remote add upstream https://   <<--put link here
git remote -v
git fetch upstream
git merge upstream/master

>>> Check and fix merged files <<<

git add --all
git push

###Notes for Hugh Yes you "can" write anything with as little code as possible. The question is do you want to actually work on code that is readable.

I've been programming my whole life and went to school for Computer Science. In my journey you come across people that write short cryptic code and they think, less is better. It is not. Six months from now when you don't have the code in your working memory you'll read some line that makes no sense because the programmer sacrificed readability for "impressive" or "quick" or "condensed" code.

This Code is for everybody not just programmers!

You want people to understand the code so they trust the code with their data. Yes, short code with little comments may work for you, now. But this project is on GitHub for everybody in the world. Do you want to have your code only "serviceable" by you right now? Or do you want others to use and adapt your code?

Here is some great advice: http://mywiki.wooledge.org/BashGuide/Practices

Your idea for this project is truly awesome! And giving it to the world is truly a blessing to all.

####Bash

For reference: http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
Arrays: http://mywiki.wooledge.org/BashGuide/Arrays ; http://tldp.org/LDP/abs/html/arrays.html
Nested Loops: http://www.tldp.org/LDP/abs/html/nestedloops.html

####Git

Cherry picking with git: https://www.kernel.org/pub/software/scm/git/docs/git-cherry-pick.html

####Notes by Hugh for where he got what.

@Jonathan to find this I was looking here: http://unix.stackexchange.com/questions/138634/shortest-way-to-extract-last-3-characters-of-base-minus-suffix-filename I am not sure how to implement this in this code base right now.

####From Martin

I have no such tool to hand, but using the palaso-python library, we could write one. The first thing is to work out exactly what you want. Do you simply want a list of every possible unicode character that a keyboard could produce or do you want a list of possible minimal strings that a keyboard could produce or a simple key-mapping (for which I already have a tool)?

Basically you will want something along the lines of:

import palaso.kmfl as kmfl
from palaso import kmn

kbd = kmfl(sys.argv[1])
allchars = set()
for i in range(kbd.numrules) :
        for s in map(kmn.item_to_char, kbd.flatten_context(i, side = 'r')) :
                allchars.add(s)
print allchars

beware this code is completely untested and therefore is highly unlikely not to have bugs in it. You'll want to write some code to prettify the output to what you want.

GB, Martin

Second Reply from Martin

So, I am talking about single "functional units". There may be multi-key processes to achieve production, and they may be encoded in multiple Unicode code points, but at some level they are a single production target in the text production process.

Does this help?

Indeed. It's all in the requirements. OK so the fragment changes slightly to:
import palaso.kmfl as kmfl
from palaso import kmn

kbd = kmfl(sys.argv[1])
allchars = set()
for i in range(kbd.numrules) :
       print map(kmn.item_to_char, kbd.flatten_context(i, side = 'r'))
OK. So you may want to take each of the outputs of the map and prettify it somewhat, but you get the idea? If you are still stuck, I can put together an ipython notebook for you on the topic :)

####From Marc

Hi Hugh,

I don’t have an immediate solution to your question – the Keyman source language is non-trivial to parse, although some for your requirements you may be able to get away with a lot less processing. We have Windows-based tools for analysis of a keyboard layout, but this may not be all that helpful to you.

I am not sure if you are up for writing your own script to parse the source files or not. If you are, then I would advise the following process:
  The file format can be ANSI, UTF-8, or UTF-16. Convert the file to your preferred format before parsing.
  Comments: For each line, strip any text following “c “ or “C “ – but only outside quotation marks
  Line concatenation: Then, if a line ends in a “\” (ignore whitespace), delete the backslash and concatenate with the next line
  Then there are only two types of lines to analyse, pseudo tokenized:
“store” “(“ store_name “)” value

context [“+” key] “>” value
  Ignore all other lines
  If the store_name token starts with “&” ignore the line.
  You will be interested only in the value and output tokens.  These are delimited by the close paren “)” token in the store lines and the greater than “>” token in the rule lines and finish at end of line in each case.
  Parse the value and output tokens:
Any “U+xxxx” is a Unicode character.

Any string of Unicode characters starts with a single quote (') or a double quote (") and finishes with the same quote.

Ignore any tokens between an open and a close paren.

Ignore any other tokens
I hope this helps and that I haven’t forgotten anything.

Cheers,

Marc

###Notes for all

Git PRO book For reference: https://git-scm.com/book/en/v2

Something about variables sourced from: http://linuxconfig.org/bash-scripting-tutorial
#!/bin/bash
#Define bash global variable
#This variable is global and can be used anywhere in this bash script
VAR="global variable"
function bash {
#Define bash local variable
#This variable is local to bash function only
local VAR="local variable"
echo $VAR
}
echo $VAR
bash
# Note the bash global variable did not change
# "local" is bash reserved word
echo $VAR

ash for and while loops

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
Admin Stuff		Admin Stuff
Data-Source		Data-Source
Dependencies		Dependencies
Keyboard text folder		Keyboard text folder
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
awesome-script.bash		awesome-script.bash
clean-up.sh		clean-up.sh
example-ori-corpus-james-asg.txt		example-ori-corpus-james-asg.txt
example-ori-corpus-james-bkv.txt		example-ori-corpus-james-bkv.txt
example-ori-corpus-james-deu.txt		example-ori-corpus-james-deu.txt
example-ori-corpus-james-eng.txt		example-ori-corpus-james-eng.txt
example-ori-corpus-james-ibo.txt		example-ori-corpus-james-ibo.txt
example-ori-corpus-james-nav.txt		example-ori-corpus-james-nav.txt
ffwiki-latest-pages-articles.xml.bz2		ffwiki-latest-pages-articles.xml.bz2
hawiki-latest-pages-articles.xml.bz2		hawiki-latest-pages-articles.xml.bz2
igwiki-latest-pages-articles.xml.bz2		igwiki-latest-pages-articles.xml.bz2
nvwiki-latest-pages-articles.xml.bz2		nvwiki-latest-pages-articles.xml.bz2
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Admin Stuff

Organization of script files

clean-up.bash

awesome-script.bash

Temp Files

Data-Source

Types of Corpora

Second Reply from Martin

About

Releases

Packages

Contributors 2

Languages

License

HughP/MLKA-Bash-data

Folders and files

Latest commit

History

Repository files navigation

Admin Stuff

Organization of script files

clean-up.bash

awesome-script.bash

Temp Files

Data-Source

Types of Corpora

Second Reply from Martin

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages