Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
erd.svg	erd.svg

Using Concepticon data from the CLDF dataset

CLDF is a package format for linguistic data, bundling a set of tables as CSV files with JSON metadata, describing - among other things - relations between these tables.

First, familiarize yourself with the data model by looking through cldf/README.md and the entity-relationship diagram below.

It's all just text files!

The files in a CLDF dataset are just text files (even though the text may comply with higher-level formatting rules like CSV or JSON). Thus, basic exploration of the data can easily be done with the tools available in the Unix Shell.

You can list the files in the dataset

$ ls -1 cldf
concepticon.csv
conceptlists.csv
concepts.csv
CONTRIBUTORS.md
glosses.csv
languages.csv
media.csv
parameter_network.csv
README.md
relationtypes.csv
requirements.txt
retired.csv
sources.bib
tags.csv
Wordlist-metadata.json

count rows in CSV tables

$ wc -l cldf/languages.csv 
60 cldf/languages.csv

and inspect the headers and data

$ head -n 2 cldf/languages.csv 
ID,Name,Macroarea,Latitude,Longitude,Glottocode,ISO639P3code
afrikaans,Afrikaans,,-22.0,30.0,,

Most of these are tables, though!

While the standard Unix tools can get you a long way towards parsing tabular data formatted as CSV, tools like the ones provided by csvkit will make this a lot easier.

E.g. you could list all english glosses mapped to FIRE running

$ csvgrep -c Parameter_ID -r "^221$" cldf/glosses.csv | csvcut -c Language_ID,Form | sort | uniq -i | grep english
english,*fire
english,fire
english,fire 
english,fire*
english,fire-
english,Fire
english,"FIRE, (FIREWOOD)"
english,fire (for cooking)/heat
english,fire (n.)
english,the fire

Notes:

csvgrep allows you to "grep" in a CSV-aware way, i.e. considering only specified columns in a line.
csvcut allows you to cut out relevant columns of a CSV file, thus creating output that is more more easily manipulated using standard shell tools.
The gloss FIRE, (FIREWOOD) contains a comma, hence naively trying to split columns by splitting text lines at commas would have failed.

Related tables, that is.

If we wanted to list concepts of a conceptlist with gloss and mapped Concepticon gloss, we'd have to join data from two tables: glosses.csv and concepticon.csv (and exploit the fact that gloss identifiers are based on the conceptlist identifiers):

$ csvjoin cldf/glosses.csv cldf/concepticon.csv -c Parameter_ID,ID | csvcut -c Form,Concept_ID,Name | csvgrep -c Concept_ID -m"Swadesh-1955-100"
Form,Concept_ID,Name
all,Swadesh-1955-100-1,ALL
bone,Swadesh-1955-100-10,BONE
yellow,Swadesh-1955-100-100,YELLOW
*breast,Swadesh-1955-100-11,BREAST
burn,Swadesh-1955-100-12,BURN
*claw,Swadesh-1955-100-13,CLAW
cloud,Swadesh-1955-100-14,CLOUD
...

Now, if we wanted to do this "properly", i.e. use only the conceptlist identifier Swadesh-1955-100 as input, we'd have to join data from concepts.csv, too, which gets a bit cumbersome on the shell. But CLDF contains all the information necessary to load a dataset into a relational database - and the pycldf package can turn any CLDF dataset into a SQLite database, which can be queried more comfortably using SQL.

$ cldf createdb cldf/Wordlist-metadata.json concepticon.sqlite
INFO    <cldf:v1.0:Wordlist at cldf> loaded in concepticon.sqlite

Now we can list the concepts of a conceptlist much like the web app does (e.g. at https://concepticon.clld.org/contributions/Swadesh-1955-100), running a query like

SELECT c.cldf_id, ft.cldf_form, pt.cldf_name FROM
  ContributionTable as ct,
  `concepts.csv` as c,
  FormTable as ft,
  ParameterTable as pt
WHERE
  ct.cldf_id = c.cldf_contributionReference AND
  ft.Concept_ID = c.cldf_id AND
  pt.cldf_id = ft.cldf_parameterReference AND
  ct.cldf_id = 'Swadesh-1955-100'
ORDER BY
  c.`Index`

via

$ sqlite3 concepticon.sqlite -header < query.sql

to get

cldf_id	cldf_form	cldf_name
Swadesh-1955-100-1	all	ALL
Swadesh-1955-100-2	ashes	ASH
Swadesh-1955-100-3	bark	BARK
Swadesh-1955-100-4	belly	BELLY
Swadesh-1955-100-5	big	BIG
...	...	...

SQLite can also help with querying the graph of relations between conceptsets which are modeled as edges, i.e. rows in the ParameterNetwork table. E.g. all conceptsets that are narrower than BROTHER can be found using the following recursive common table expression:

WITH RECURSIVE
  narrower(n) AS (
    SELECT cldf_id FROM ParameterTable WHERE cldf_name = 'BROTHER'
    UNION
    SELECT cldf_sourceParameterReference FROM ParameterNetwork, narrower
     WHERE ParameterNetwork.cldf_targetParameterReference=narrower.n AND ParameterNetwork.relation = 'broader'
  )
SELECT cldf_id, cldf_name FROM ParameterTable
 WHERE cldf_id IN narrower;

cldf_id	cldf_name
1262	BROTHER
1759	OLDER BROTHER
1760	YOUNGER BROTHER
2414	OLDER BROTHER (OF MAN)
2415	OLDER BROTHER (OF WOMAN)
2416	YOUNGER BROTHER (OF MAN)
2417	YOUNGER BROTHER (OF WOMAN)
559	BROTHER (OF MAN)
560	BROTHER (OF WOMAN)

We might also investigate conceptsets with a large number of concepts with english glosses differing from the "default" Concepticon gloss using a query like

SELECT
    concepticon_id, concepticon_gloss, gloss, max(c) AS ngloss
FROM (
    SELECT
        p.cldf_id AS concepticon_id,
        p.cldf_name AS concepticon_gloss,
        c.cldf_name AS gloss,
        count(c.cldf_id) AS c
    FROM
        parametertable AS p,
        `concepts.csv` AS c,
        formtable AS f
    WHERE
        c.cldf_parameterreference = p.cldf_id AND
        f.concept_id = c.cldf_id AND
        f.cldf_languagereference = 'english'
    GROUP BY p.cldf_id, c.cldf_name
) AS q
WHERE
    lower(concepticon_gloss) != lower(gloss) AND
    cast(concepticon_id as int) != 0
GROUP BY concepticon_id
HAVING ngloss > 5
ORDER BY cast(concepticon_id AS int) LIMIT 5;

concepticon_id	concepticon_gloss	gloss	ngloss
2	DUST	the dust	6
12	MIDDAY	noon	39
21	TASTE (SOMETHING)	taste	15
24	PRAY	to pray	7
38	SPREAD OUT	spread	7

Dataset commands

concepticon-cldf makes use of dataset specific sub-commands to be run with cldfbench to provide advanced functionality. To make these subcommands available, you need to "install" the dataset as Python package, running

pip install -e .

`cldfbench concepticon.intersection`

You can compute concepts shared between conceptlists with the cldfbench concepticon.intersection command. A "classical" use case would be replicating the claim that

Comrie-1977-207 = Swadesh-1952-200 + Swadesh-1955-100

First, we merge the two Swadesh lists:

$ cldfbench concepticon.intersection Swadesh-1955-100 Swadesh-1952-200 --maxdist 2 --output-union | \
  wc -l
207

Note that this took quite long (~20secs), because the implementation first loads the data into an in-memory sqlite database. If we specify the database we already created above, this will run a lot quicker:

f$ time cldfbench concepticon.intersection --db concepticon.sqlite Swadesh-1955-100 Swadesh-1952-200 --maxdist 2 --output-union | wc -l
207
real    0m1,512s

Then we pipe the merged Swadesh list as first list to a second cldfbench concepticon.intersection call (using the - notation):

$ cldfbench concepticon.intersection Swadesh-1955-100 Swadesh-1952-200 --maxdist 2 --output-union | \
  cldfbench concepticon.intersection - Comrie-1977-207 --maxdist 3 --equate-instanceof --output-union --verbose | \
  wc -l
207

So the union of Swadesh-1955-100 and Swadesh-1952-200 yields 207 concepts, and merging these with the concepts of Comrie-1977-207 yields (roughly) the same 207 concepts.

Note that for this to work out, we had to involve Concepticon's conceptset relations when comparing concepts for equality, using the --maxdist and --equate-instanceof options. (Run cldfbench concepticon.intersection -h for details.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

README.md

Using Concepticon data from the CLDF dataset

It's all just text files!

Most of these are tables, though!

Related tables, that is.

Dataset commands

`cldfbench concepticon.intersection`

Files

doc

Directory actions

More options

Directory actions

More options

Latest commit

History

doc

Folders and files

parent directory

README.md

Using Concepticon data from the CLDF dataset

It's all just text files!

Most of these are tables, though!

Related tables, that is.

Dataset commands

cldfbench concepticon.intersection

`cldfbench concepticon.intersection`