For local development documentation, see docs/local-development.rst
.
DativeBase is an application for linguistic data management. It is designed to be useful for linguists, language revitalizers, teachers, linguaphiles, and anybody who needs to manage language-focused data. DativeBase facilitates storing, searching, sharing, and analyzing linguistic data.
DativeBase is the successor of the earlier Dative/OLD project. Dative/OLD and DativeBase are both open-source software. However, there is only one significant public deployment of the Dative/OLD, namely the one served at app.dative.ca. The plan is for the DativeBase rewrite to ultimately replace the Dative/OLD app at app.dative.ca.
Table of Contents:
- Authorization
- User Flows
- Data Model
- Continuous Integration & Deployment
- TODOs
- Principles
- How Immutable Data Works in DativeBase
- History of DativeBase
- Components
- Proof-of-concept Feature Brief for Read-only Offline Functionality
- Local Development
- Local SwaggerUI
- Docker
- The Online Linguistic Database (OLD)
- Usage
- Database Migrations
This section details the authorization rules in DativeBase. The guiding principle underlying the authorization design decisions described below is that anyone should be able to sign up for a free DativeBase plan and get started using DativeBase right away.
Our first important, foundational distinction is that between OLD-specific resources, like forms, and OLD-independent resources, like users, plans and OLDs.
In general, if a user has the contributor or administrator role for a given OLD, then that user will be authorized to make mutative (data-changing) requests on any resource under said OLD. A user with the viewer role under an OLD can only make read requests on that OLD.
A second salient distinction is superuser status. Each user is a superuser or a non-superuser. Most users are non-superusers. A superuser can, in general, perform any read or write action in the system. All entities must reference a user as their creator and updater. The only exception to this is the user entity itself, which must be boostrap-able without a user.
An OLD must have an active plan in order to be usable. If an action on an OLD is not covered by the entitlements granted by the plan, then the action will be prohibited.
Users are the entrypoint to DativeBase. Anyone can create a user and then use that user to create a free plan and then a number of OLDs running under that free plan.
A new user may be created via the create-user operation, i.e., POST /users.
Authentication is not required in order to create a user in DativeBase. Such user creation is effectively signup. Anybody on the public internet should be able to sign up to DativeBase. They should be able to create a user, a free plan, and one or more OLDs (with restricted entitlements) under said plan.
Obviously, since a superuser has unlimited access, a user created without authentication may never be a superuser.
In addition, a user created without authentication must be activated before it can be used. User activation means hitting a specific endpoint with a specific, randomly generated UUID in its path. In production, this URL will be emailed to the user.
User update is only allowed to superusers and the target user itself. Only a superuser can update a user into a superuser. Therefore, a superuser can only be created by someone with backend access to the DativeBase system.
User deletion is prohibited. Soft deletion may be supported in the future. Lossy or non-lossy user redaction may also be supported in the future. The challenge with user deletion is that provenance is crucial to a knowledge base such as DativeBase. Therefore, full user deletion, without careful attention, would corrupt the data.
It would probably be wise to support user deactivation (as a minimal user deletion strategy) in the short term. It should be noted that a user could still exist in the system while having access to no OLDs and no plans, which in itself is a form of deactivation.
In order to reset the password of a user, the following steps must be taken.
- The user makes a
GET /users/<ID>/initiate-password-reset
call. - DativeBase refreshes the user's registration key and emails this key to the email address of the user.
- The user makes a
PUT /users/<ID>/reset-password
call. The JSON payload contains the newpassword
and thesecret-key
, whose value is the registration key that we emailed to the user. If successful, the password of the user will be set to the password supplied in this PUT request.
Any authenticated and activated user can view (read) the set of users in DativeBase. Users need to be able to view other users in order to be able to add these users to their OLDs and/or to their plans.
On the other hand, users should be able to submit a request for access to an OLD and administrators should be able to view such requests.
Note that non-superusers receive limited user data. They are not able to view the email addresses of users, for example. A non-superuser can view their own data in full, howeever.
Once a non-superuser has been created, the typical next step is to create a free plan with that user. A free plan allows limited access to the DativeBase service. The details are still to-be-developed. However, we may provisionally assume that each free plan allows for 3 OLDs building under it, each with a maximum number of forms. Further restrictions may be enabled later.
Any user may create a new, free plan. This is accomplished via a POST /plans request.
However, each (non-superuser) user is permitted to be the manager of at most 1 plan. Given that creating a plan also entails the creator receiving a manager role on said plan, this means, in effect, that each (non-superuser) user can only create one plan. (If the user revokes their manager role over the plan, then they may create a new plan.)
Plan update is not currently supported. The only property of a plan that can meaningfully be updated is the tier and upgrading the tier from free to higher requires a billing event.
A plan can be deleted by a superuser or one of the plan's managers. However, a plan cannot be deleted while it is supporting OLDs. If any OLDs are running under a plan, then these OLDs must first be removed from the plan before it can be deleted. To remove an OLD from a plan, update the OLD (PUT /olds/:id) while setting the plan ID to nil.
OLDs are a core resource in DativeBase. Each OLD (= Online Linguistic Database) is a data set, usually focused on a particular language, but sometimes on a research topic.
Any user may create an OLD via the POST /olds operation. Creation of an OLD automatically entails making the creating user an administrator of the newly-created OLD.
An OLD that is not covered by a plan is not usable. An OLD can be configured to be paid for under a plan during OLD creation or OLD update. In either case, the authenticated user must be a manager of the plan in question (or a superuser of the system) in order for the request to be authorized.
An OLD can be updated or deleted only by its administrators and by superusers.
All users can read the collection of OLDs (index) and get details on a specific OLD (show). Users need to be able to browse the set of OLDs in order for DativeBase to work.
Forms belong to OLDs. As do tags, corpora, files, phonologies, etc. A user's authorization to read or write OLD-specific resources depends on that user's role within the OLD.
An administrator can perform any action. A contributor can perform most write actions and all reads. A viewer can perform all read actions but no writes.
- Signup: person creates a DativeBase user
- Plan Creation: User creates a plan for managing OLDs.
- Grant Access: Administrator of an OLD grants access to a user to an OLD.
- Cover OLD: Administrator of a plan covers an OLD under that plan.
As a prospective user of DativeBase, I can create an account (a user) in DativeBase. As a result of signing up, a new user is created for me in DativeBase.
Implications:
- Anybody on the public internet can create a new account.
- Email verification must be required. Therefore, signup is a two-step process.
- First, the user signs up by entering their PII and desired credentials. DativeBase then emails the user a registration confirmation link containing a key, which expires.
- Then, the user visits the link, which triggers authentiction. If the authentication test passes, the user is verified.
Steps to implement:
- All users must have a registration-status attribute. Its default is pending. It can transition from pending to registered.
- A pending user cannot perform any actions except verification. Once verification succeeds, the user becomes registered.
As a user of DativeBase, I can create a plan. A plan lets me pay for and manage OLDs. If I have a plan, I can create new OLDs that are covered by that plan, insofar as the entitlements of my plan allow for this. If I have a plan, I can cover existing with that plan. I can transfer coverage of an OLD from its existing plan to my plan.
There are four basic entities:
- Users
- OLDs
- Plans
- Forms
Users have inherent roles. All users are either regular users or superusers. Superusers have unlimited access to all public APIs.
A user may have access to an OLD or not. In order for a user to have access to
an OLD, there must be an active users_olds
row linking said user to said OLD.
The role
value of this row determines the user's level of access to the OLD.
An administrator can perform all actions on an OLD. A contributor can perform
nearly all actions on an OLD. A viewer can only perform read actions on an OLD;
no writes are permitted.
A plan pays for an OLD. Every OLD must be covered by a plan. If an OLD exceeds the entitlements of its plan, then the OLD becomes non-operational. In order to re-enable the OLD, the plan must be upgraded or the OLD must be moved under another, more entitled plan.
- Ensure that the commands in the
Docker
section are working. - I need to more clearly justify the inserted vs created distinction. Are both of these columns really necessary?
- Add stats infrastructure. See https://www.metricfire.com/blog/monitoring-your-infrastructure-with-statsd-and-graphite/.
- Add specs for database tables.
- Sustainability
- Open Data
- Immutability
DativeBase must be sustainable. That is why it is both open-source and monetizable as a service.
The source code of DativeBase is, and always will be, open-source and free. This means that even if the maintainers and developers of DativeBase change, its inner workings are always available for inspection, adoption, and future development.
Software requires maintenance and non-remunerated maintenance is almost inevitably short-lived. If DativeBase provides value to its users, then those users should be happy to pay a modest fee for its use. If a prospective user lacks the funds, they may reach out and be granted an exemption from the subscription fee.
DativeBase will never hold your data hostage. DativeBase will provide full exports of data to the owners or stewards of that data, in open formats, i.e., formats that do not require proprietary software to be read and manipulated.
DativeBase will provide standard OpenAPI-compliant HTTP REST endpoints for fetching data sets. Datasets will be available in standard, open formats: primarily JSON, .zip archives, and CSV files.
DativeBase will include local-first functionality. This may be a fully-fledged Desktop application or it may be a progressive web app that stores data locally in the browser's local storage. Whatever the case, DativeBase will give users access to the data on their own machines. DativeBase will provide seamless synchronization between local data and shared datasets on the server.
DativeBase will provide immutable data. This means data that both changes yet also preserves its history. All previous states of all data points are preserved.
This strategy facilitates synchronization between local datasets and their remote counterparts. However, it also preserves the history and provenance of data, which may itself have scientific utility.
The data in DativeBase is immutable. This means that the data changes yet its history is never lost. The effect of this is that updated or destroyed data can be restored. Another, perhaps more important, consequence is that two versions of a dataset (i.e., an OLD) can diverge and can later be merged (or synchronized).
All immutable entities have their current state stored in traditional database
tables. For example, the current state of a form with ID "A" is stored in table
forms
.
When an entity, such as a form, is deleted, we do not actually drop the row from
the database. Instead, we update its destroyed_at
value, changing it from
NULL
to the timestamp of deletion.
To see the database schema of the OLD server, inspect the top-level file
schema.sql
. Alternatively, interact with the database directly via PSQL
using make db
and run commands like \dt
and \d+ events
.
The histories of all immutable entities are stored in the events
table.
Every time an entity is created, updated, or deleted, we store an event in this
table.
The data in the events
table is (and must be) sufficient to fully
reconstruct all of the data within the DativeBase instance. That is, we should
be able to drop all rows from all other tables and then perfectly reconstruct
the data in those tables using only the data in the events table.
The events
table is an append-only log. No SQL UPDATE
or DELETE
operations should ever be run on this table. Only INSERT
oeprations are
permitted.
In order to fully understand the events table, one must first internalize the basic relationship between users, OLDs, and OLD-internal types, prototypically forms. Every user has access to zero or more OLDs. Every OLD contains zero or more forms.
Here is the schema of the events
table:
CREATE TABLE public.events ( id uuid DEFAULT public.uuid_generate_v4() NOT NULL, created_at timestamp with time zone DEFAULT now(), old_slug text, table_name text NOT NULL, row_id uuid, row_data text NOT NULL, CONSTRAINT events_check_old_slug_or_row_id CHECK (((old_slug IS NOT NULL) OR (row_id IS NOT NULL))) );
Details on the columns of the events
table are provided below.
id
: This is the unique identifier and primary key of the event. Its value is A UUID.created_at
: This is a (UTC) timestamp indicating when the event was created in DativeBase.old_slug
: This is the slug (unique identifier) of the OLD to which the event applies.- Some entities, such as users, are not specific to a single OLD. The events
of such non-OLD-specific entities will have a value of
NULL
in this column. - Other entities, such as forms, are specific to a single OLD. The events
of such non-OLD-specific entities will have the slug of the entity's OLD in
this column.
- The OLDs themselves do have a non-null value in the
events.old_slug
column. This value is theslug
value of the OLD itself.
- The OLDs themselves do have a non-null value in the
- Some entities, such as users, are not specific to a single OLD. The events
of such non-OLD-specific entities will have a value of
table_name
: This is the name of the table where the entity's current state is held. The table defines the type of the entity. Forms, for example, are stored in theforms
table and mutation events on forms have a value of"forms"
in thetable_name
column of theevents
table.row_id
: This column holds the unique ID of the entity. Typically, this is the value of theid
column in the corresponding entity table, e.g.,forms.id
orusers.id
.- Since OLDs use
slug
as their ID, mutation events on OLDs have aNULL
value inevents.row_id
.
- Since OLDs use
row_data
: This column holds a serialized representation of the state of the entity at thecreated_at
date.- The data in
row_data
is serialized using EDN. - Example:
- If a new form is created with transcription
"a"
, an event will be created whererow_data
contains an EDN-serialized representation of the form with transcription"a"
. - If a our form is updated to have transcription
"b"
, an event will be created whererow_data
contains an EDN-serialized representation of the form with transcription"b"
. - Finally, if a our form is deleted, an event will be created where
row_data
contains an EDN-serialized representation of the form with adestroyed_at
value of the timestamp of deletion.
- If a new form is created with transcription
- The data in
Forms are an example of an immutable and OLD-specific entity type. Forms are
stored in the forms
table. See below.:
CREATE TABLE public.forms ( id uuid DEFAULT public.uuid_generate_v4() NOT NULL, old_slug text NOT NULL, transcription text NOT NULL, inserted_at timestamp with time zone DEFAULT now() NOT NULL, created_at timestamp with time zone DEFAULT now() NOT NULL, updated_at timestamp with time zone DEFAULT now() NOT NULL, destroyed_at timestamp with time zone, created_by uuid NOT NULL );
Each form belongs to a specific OLD. The forms.old_slug
value is the
olds.slug
value of the OLD to which the form belongs.
The inserted_at
and created_at
columns are similar in that both are
timestamps that default to the time of insertion. However, they are importantly
different. The created_at
value indicates when the form was created by the
user. The created_at
value should never change.
The inserted_at
value is generally identical to created_at
. However,
when a changeset (i.e., an ordered set of events) is ingested into the OLD, the
inserted_at
value will be the time of insertion.
DativeBase is a complete rewrite (in Clojure & ClojureScript) of the existing Dative/OLD suite of linguistic data management tools.
Dative is already 1/3 rewritten in ClojureScript. See DativeReFrame. That project will become a submodule of this one.
The motivation behind this rewrite is twofold. First, DativeBase must be monetizable. Second, DativeBase must be a local-first application. (Third, Python is not as good as Clojure.)
common: Common code between components: specs, OpenAPI schemata, etc.
server: HTTP OpenAPI JSON service - One set of users managing multiple OLDs, each containing forms. - Monetization built in: plans cover the costs of OLDs. Plans have free,
subscriber, and supporter tiers. Users manage plans.
client: HTTP client conveniences for interacting with server. Can be required by desktop, synchronizer, gui, etc.
gui: Dative ReFrame SPA - Uses the API to provide user-friendly access to a user's OLDs. - Uses the API to allow manager users to manage OLD plans.
TODO: desktop: DativeTop: Desktop-native, or Electron-like, desktop app that interacts with local OLDs and allows synchronization. - Similar experience to Dative, but as a native app built on JVM CLJ-F
(https://github.com/cljfx/cljfx), ClojureDart, Electron with ClojureScript, or other.
TODO: synchronizer: library for synchronizaing follower OLDs with leaders. Can be used by desktop.
TODO: morphoparser: separate, queue-based service for morphological parser compilation, parsing, serving, etc.
Proof-of-concept feature brief:
Given DativeTopCLJ running on a local machine And OLDCLJ running as a service on a local machine And an OLD data set that is synced across DativeTopCLJ and OLDCLJ When the user disconnects their wifi Then the user can still read their OLD data set in DativeTopCLJ
Follow these detailed steps to get the server (API) running locally and to confirm that it is working as expected.
Construct the OpenAPI YAML from the OpenAPI EDN source and validate it:
$ make openapi $ make lint-openapi No results with a severity of 'error' found!
The first command generates the OpenAPI YAML specification file
resources/public/openapi/api.yaml
from the Clojure source of truth at
dvb.server.http.openapi.spec/api
. The second command lints the YAML file using
the spectral library.
Start the PostgreSQL database in a container and create the tables:
$ docker compose up -d --build
Run the tests (optional):
$ make tests
Connect to the database via PSQL (optional):
$ make db
The default configuration for the application is in dev-config.edn
.
The recommended way to run the server code while developing is from a
Clojure-integrated REPL, e.g., Emacs with Cider. See the expressions in the
comment block of dvb.server.repl
. Executing the following expression in that
code block will restart the system after reloading any code changes:
=> (component.repl/reset) :ok
To serve the application from the command line (i.e., a fresh Java process) with the default config, the following are equivalent:
$ make run $ clj -X:run
No matter how the app was started up, you may access the API at
http://localhost:8080
and the Swagger UI at
http://localhost:8080/swagger-ui/dist/index.html
.
To serve the application with a different configuration file:
$ clj -X:run :config-path '"/home/joel/apps/dativebaseclj/dev-config-SECRET.edn"'
Create a user with a specified email and password (optional):
$ clj -X:init :password abc :email '"abc@bmail.com"' {:user {:id #uuid "9af83804-2354-4884-8600-f4699794a468", :first_name "Anne", :last_name "Boleyn", :email "abc@bmail.com", :password "HASH"})}
We can also create a new user from the REPL. In the dvb.server.repl
ns,
search for Create a new user, so we can login
and define a user
while
creating it in the database, as shown there.
FOX
Current issue: we cannot authenticate API requests because we cannot yet create a user and an API key (machine user). See above.
The following log message is emitted when we attempt an API call with an app ID that is not valid, i.e., does not exist in the DB:
Unable to locate the referenced machine-user. {:x-app-id "7ffb9182-f7f9-4a32-a931-0e9ad303e830"}
This happens when the app ID is not a valid UUID string:
Exception thrown when attempting to query machine user based on X-APP-ID {:x-app-id "def"}
This happens when one has not provided X-API-KEY (or X-APP-ID) in the request, i.e., has not "authorized" in the SwaggerUI interface:
A required API key value was not provided in the request. {:name "X-API-KEY", :in :header}
If you have DativeBase running locally, you can interact with its HTTP API via the SwaggerUI at http://localhost:8080/swagger-ui/dist/index.html.
First, you must ensure that you have a valid user in the database and that you have identified an API key and ID for that server.
Build a docker image for DativeBase:
$ docker build -t dativebase .
Run DativeBase in a docker container:
$ docker run -it --rm --name my-running-dativebase dativebase
Note that the last command above currently fails because the DativeBase server is
unable to make a connection to PostgreSQL at localhost:5432
. TODO
The code under src/dvb/server
corresponds to the Online Linguistic Database
(OLD) of the original Python Dative system.
A major sub-component of the server is an HTTP REST API that conforms to the OpenAPI spec.
This project is written in Clojure. This is a rewrite of a previous project of the same name, written in Python. See TODO. When it is important to distinguish between the two projects, this one may be referred to as "OLD-CLJ".
To serve the OLD and a Swagger UI for interacting with it:
$ lein run
Now visit the Swagger UI at:
http://localhost:8080/swagger-ui/dist/index.html
Click the "Authorize" button and enter the API key "olddative".
Now click "GET /api/v1/forms", then "Try it out", then "Execute". The Swagger UI will make a request to the OLD and will receive a mock response.
To create a database migration, first create a new migration file under
migrator/sql
with:
$ ./scripts/create-migration.sh replace_me_with_migration_name
Then rebuild the docker images and bring up the containers in order to trigger
the Flyway container migrator
into creating the database schema in the
postgres
container:
$ docker compose up -d --build --force-recreate
Verify that the migrator exited successfully, with either of the following:
$ docker compose logs -f migrator $ docker compose ps
Finally, write the schema to schema.sql
so that the revised schema (post
migration application) can be checked into version control:
$ make schema.sql
If the above works, you should see changes in the schema.sql
file that
reflect your migration.
In order to transition from Dative(/OLD) to DativeBase, we need to be able to ingest the OLD data into the DativeBase schema.
Keeping it simple to start, imagine we can shut down all external mutation to a given OLD. How would we migrate it?
TODO. Return here