Skip to content
Miguel Guimarães edited this page Jan 26, 2021 · 21 revisions

How it works

Database preservation toolkit converts from a source database format to a destination database format. The format may be a database management system or a preservation format.

To retrieve from a source, the application uses an import module.

To write to a destination, the application uses an export module.

To perform any intermediate actions, the application may use one or more filter modules.

It is the pair composed of an import module and an export module that provides the conversion functionality. There are different modules which can be used and even configured to provide a conversion between database formats.

General usage

The command line application takes a series of arguments, that can be provided in any order. These define the application's behavior.

General usage

java [properties] -jar dbptk-app-x.y.z.jar migrate <importModule> [import module options] <exportModule> [export module options] [<filterModule(s)> [filter module options]]

How to specify the parameters

The general use command is generic and cannot be used as is. Here are a list of modifications that must be carried out:

  • java is the java command, the full path may also be used
  • [properties] may be omitted or replaced with special configurations that influence the conversion (more details)
  • -jar dbptk-app-x.y.z.jar tells java to execute the dbptk-app-x.y.z.jar file (the file name must be adjusted to match the one you have)
  • <importModule> should be replaced with the import module specification, e.g. -i mysql or --import=postgresql
  • <exportModule> should be replaced with the export module specification, e.g. -e mysql or --export=postgresql
  • <filterModule(s)> should be replaced with a list of filter module specifications separated by ',' with no spaces, e.g. -f external-lobs or --filter=external-lobs,external-lobs
  • [import module options] should be replaced with parameters to specify the behavior of the import module, e.g. --import-username=username --import-password="p4ssw0rd" (to specify source database username and password)
  • [export module options] should be replaced with parameters to specify the behavior of the export module, e.g. --export-file=filename.siard --export-compress --export-pretty-xml (to specify the SIARD-2 export module behavior)
  • [filter module options] should be replaced with parameters to specify the behavior of the filter module(s). As there can be multiple filters declared, these parameters should contain the index of the filter they refer to in the list (even if this list is composed of only one element), e.g. --filter1-dir /home/user/ --filter1-disable-print-header (to specify the inventory filter module behavior)

Short/long parameter format

Parameters have two interchangeable formats, a longer format for readability (e.g. --import-hostname=localhost) and a short format which is faster to type (e.g. -i localhost). Notice that the difference is the shorter/longer parameter name and the number of short dashes used (there is no distinction in using space character or equal sign to separate parameters).

Parameters

Available import modules, for the [import module options] part

Specify the import module with: -i <module>, --import=module

Import module: jdbc

-id,  --import-driver=value      (required) the name of the the JDBC driver class. For more info about this refer to the website or the README file
-ic,  --import-connection=value  (required) the connection url to use in the connection

Note: In order to use this module you need to a JDBC driver. Please refer to this documentation on how to import your on driver.

Import module: microsoft-access

-if,  --import-file=value      (required) path to the Microsoft Access file
-ip,  --import-password=value  (optional) password to the Microsoft Access file

Import module: microsoft-sql-server

-is,   --import-server-name=value     (required) the name (host name) of the server
-idb,  --import-database=value        (required) the name of the database we'll be accessing
-iu,   --import-username=value        (required) the name of the user to use in the connection
-ip,   --import-password=value        (required) the password of the user to use in the connection
-il,   --import-use-integrated-login  (optional) use windows login; by default the SQL Server login is used
-ide,  --import-disable-encryption    (optional) use to turn off encryption in the connection
-iin,  --import-instance-name=value   (optional) the name of the instance
-ipn,  --import-port-number=value     (optional) the port number of the server instance, default is 1433

Import module: mysql

-ih,   --import-hostname=value      (required) the hostname of the MySQL server
-idb,  --import-database=value      (required) the name of the MySQL database
-iu,   --import-username=value      (required) the name of the user to use in connection
-ip,   --import-password=value      (required) the password of the user to use in connection
-ipn,  --import-port-number=value   (optional) the port that the MySQL server is listening, default is 3306
-ide,  --import-disable-encryption  (optional) use to turn off encryption in the connection

Import module: oracle

-is,   --import-server-name=value  (required) the name (or IP address) of the Oracle server
-idb,  --import-database=value     (required) the name of the database to use in the connection
-iu,   --import-username=value     (required) the name of the user to use in connection
-ip,   --import-password=value     (required) the password of the user to use in connection
-ipn,  --import-port-number=value  (required) the port that the Oracle server is listening, default is 1521
-ial,  --import-accept-license     (optional) declare that you accept OTN License Agreement, which is necessary to use this module

Import module: postgresql

-ih,   --import-hostname=value      (required) the name of the PostgreSQL server host (e.g. localhost)
-idb,  --import-database=value      (required) the name of the database to connect to
-iu,   --import-username=value      (required) the name of the user to use in connection
-ip,   --import-password=value      (required) the password of the user to use in connection
-ide,  --import-disable-encryption  (optional) use to turn off encryption in the connection
-ipn,  --import-port-number=value   (optional) the port of where the PostgreSQL server is listening, default is 5432

Import module: sybase

-ih,   --import-hostname=value      (required) the name (host name) of the server
-idb,  --import-database=value      (required) the name of the database to use in the connection
-iu,   --import-username=value      (required) the name of the user to use in connection
-ip,   --import-password=value      (required) the password of the user to use in connection
-ide,  --import-disable-encryption  (optional) use to turn off encryption in the connection
-ipn,  --import-port-number=value   (optional) the port of where the Sybase server is listening, default is 2638

Note: In order to use this module you need to use the proprietary driver. Please refer to this documentation on how to import your on driver.

Import module: progress-openedge

-ih,   --import-hostname=value      (required) the name (host name) of the server
-idb,  --import-database=value      (required) the name of the database to use in the connection
-iu,   --import-username=value      (required) the name of the user to use in connection
-ip,   --import-password=value      (required) the password of the user to use in connection
-ide,  --import-disable-encryption  (optional) use to turn off encryption in the connection
-ipn,  --import-port-number=value   (optional) the port of where the Sybase server is listening, default is 20931

Note: In order to use this module you need to use the proprietary driver. Please refer to this documentation on how to import your on driver.

Import module: siard-1

-if,  --import-file=value  (required) Path to SIARD1 archive file

Import module: siard-2

-if,  --import-file=value  (required) Path to SIARD2 archive file

Import module: siard-dk

-if,  --import-folder=value  (required) Path to (the first) SIARDDK archive folder. Archive folder name must match the expression AVID.[A-ZÆØÅ]{2,4}.[1-9][0-9]*.1 Any additional parts of the archive (eg. with suffixes .2 .3 etc) referenced in the tableIndex.xml will also be processed.
-ias,  --import-as-schema=value  (required) Name of the database schema to use when importing the SIARDDK archive. Suggested values: PostgreSQL:'public', MySQL:'<name of database>', MSSQL:'dbo'

Import module: import-config

-if,  --import-file=value        (required) path to the import configuration file to be read by the SIARD export module
-ip,  --import-parameters=value  (required) pair of parameters to be resolved in the YAML configuration file. To define a pair use this syntax: key:value;key:value;

Available export modules, for the [export module options] part

Specify the export module with: -e <module>, --export=module

Export module: jdbc

-ed,  --export-driver=value      (required) the name of the the JDBC driver class. For more info about this refer to the website or the README file
-ec,  --export-connection=value  (required) the connection url to use in the connection

Export module: microsoft-sql-server

-es,   --export-server-name=value     (required) the name (host name) of the server
-edb,  --export-database=value        (required) the name of the database we'll be accessing
-eu,   --export-username=value        (required) the name of the user to use in the connection
-ep,   --export-password=value        (required) the password of the user to use in the connection
-el,   --export-use-integrated-login  (optional) use windows login; by default the SQL Server login is used
-ede,  --export-disable-encryption    (optional) use to turn off encryption in the connection
-ein,  --export-instance-name=value   (optional) the name of the instance
-epn,  --export-port-number=value     (optional) the port number of the server instance, default is 1433

Export module: mysql

-eh,   --export-hostname=value     (required) the hostname of the MySQL server
-edb,  --export-database=value     (required) the name of the MySQL database
-eu,   --export-username=value     (required) the name of the user to use in connection
-ep,   --export-password=value     (required) the password of the user to use in connection
-epn,  --export-port-number=value  (optional) the port that the MySQL server is listening

Export module: oracle

-es,   --export-server-name=value    (required) the name (or IP address) of the Oracle server
-edb,  --export-database=value       (required) the name of the database to use in the connection
-eu,   --export-username=value       (required) the name of the user to use in connection
-ep,   --export-password=value       (required) the password of the user to use in connection
-epn,  --export-port-number=value    (required) the port that the Oracle server is listening
-eal,  --export-accept-license       (optional) declare that you accept OTN License Agreement, which is necessary to use this module
-esc,  --export-source-schema=value  (optional) the name of the source schema to export to the Oracle database. A schema with this name must exist in the Oracle database and it must be the default tablespace for the specified user. If omitted, the name of the first schema will be used

Export module: postgresql

-eh,   --export-hostname=value      (required) the name of the PostgreSQL server host (e.g. localhost)
-edb,  --export-database=value      (required) the name of the database to connect to
-eu,   --export-username=value      (required) the name of the user to use in connection
-ep,   --export-password=value      (required) the password of the user to use in connection
-ede,  --export-disable-encryption  (optional) use to turn off encryption in the connection
-epn,  --export-port-number=value   (optional) the port of where the PostgreSQL server is listening, default is 5432

Export module: siard-1

-ef,     --export-file=value                         (required) Path to SIARD1 archive file
-ec,     --export-compress                           (optional) use to compress the SIARD1 archive file with deflate method
-ep,     --export-pretty-xml                         (optional) write human-readable XML
-emd,    --export-meta-description[=value]           (optional) SIARD descriptive metadata field: Description of database meaning and content as a whole.
-ema,    --export-meta-archiver[=value]              (optional) SIARD descriptive metadata field: Name of the person who carried out the archiving of the database.
-emac,   --export-meta-archiver-contact[=value]      (optional) SIARD descriptive metadata field: Contact details (telephone, email) of the person who carried out the archiving of the database.
-emdo,   --export-meta-data-owner[=value]            (optional) SIARD descriptive metadata field: Owner of the data in the database. The person or institution that, at the time of archiving, has the right to grant usage rights for the data and is responsible for compliance with legal obligations such as data protection guidelines.
-emdot,  --export-meta-data-origin-timespan[=value]  (optional) SIARD descriptive metadata field: Origination period of the data in the database (approximate indication in text form).
-emcm,   --export-meta-client-machine[=value]        (optional) SIARD descriptive metadata field: DNS name of the (client) computer on which the archiving was carried out.

Export module: siard-2

-ef,     --export-file=value                         (required) Path to SIARD2 archive file
-ec,     --export-compress                           (optional) use to compress the SIARD2 archive file with deflate method
-ep,     --export-pretty-xml                         (optional) write human-readable XML
-eel,    --export-external-lobs    (optional) Saves any LOBs outside the siard file.
-eelpf,  --export-external-lobs-per-folder=value     (optional) The maximum number of files present in an external LOB folder. Default: 1000 files.
-eelfs,  --export-external-lobs-folder-size=value    (optional) Divide LOBs across multiple external folders with (approximately) the specified maximum size (in Megabytes). Default: do not divide.
-emd,    --export-meta-description[=value]           (optional) SIARD descriptive metadata field: Description of database meaning and content as a whole.
-ema,    --export-meta-archiver[=value]              (optional) SIARD descriptive metadata field: Name of the person who carried out the archiving of the database.
-emac,   --export-meta-archiver-contact[=value]      (optional) SIARD descriptive metadata field: Contact details (telephone, email) of the person who carried out the archiving of the database.
-emdo,   --export-meta-data-owner[=value]            (optional) SIARD descriptive metadata field: Owner of the data in the database. The person or institution that, at the time of archiving, has the right to grant usage rights for the data and is responsible for compliance with legal obligations such as data protection guidelines.
-emdot,  --export-meta-data-origin-timespan[=value]  (optional) SIARD descriptive metadata field: Origination period of the data in the database (approximate indication in text form).
-emcm,   --export-meta-client-machine[=value]        (optional) SIARD descriptive metadata field: DNS name of the (client) computer on which the archiving was carried out.
-egml,   --export-gml-directory=value                (optional) directory in which to create .gml files from tables with geometry data.
-ed,     --export-digest                             (optional) The message digest algorithm for the type of integrity information. Default: SHA-256
-efc,    --export-font-case                          (optional) Define the type of font case for the message digest. Supported font case are: upper case and lower case. Default: lowercase

Export module: siard-dk

-ef,   --export-folder=value                      (required) Path to SIARDDK archive folder. Archive folder name must match the expression AVID.[A-ZÆØÅ]{2,4}.[1-9][0-9]*.[1-9][0-9]
-eai,  --export-archiveIndex=value                (optional) Path to archiveIndex.xml input file
-eci,  --export-contextDocumentationIndex=value   (optional) Path to contextDocumentationIndex.xml input file
-ecf,  --export-contextDocumentationFolder=value  (optional) Path to contextDocumentation folder which should contain the context documentation for the archive

Export module: import-config

-ef,  --export-file=value  (required) path to the import configuration file

Available filter modules, for the [filter module options] part

Specify the filter module(s) with: -f <module(s)>, --filter=module(s)

Filter module: external-lobs

Filter module: merkle-tree

Filter module: inventory

For the [properties] part

Several properties are available to modify specific conversion behaviour. You can consider them as knobs that can be turned to fine-tune the conversion.

The properties have a format like part1.part2.part3, with multiple lower-case parts separated by dots. All properties have a corresponding environment variable, like PART1_PART2_PART3 (corresponding to the previous example), with the same parts in upper-case and separated by underscores.

Properties are added to the command line like this:

... -Dpart1.part2.part3=value -Danother.property=othervalue ...

Note: in windows, each property and value pair must be enclosed in ", example ... "-Dpart1.part2.part3=value" ...

If both the environment variable and the property are set, the property is used.

For simplicity, only the properties will be described, and the environment variables can be derived from those by using uppercased letters and replacing the dots with underscores (as described above).

Available properties

Fetch size

Controls the amount of rows that are retrieved from the database and stored in memory at once.

  • dbptk.jdbc.fetchsize.default (Integer) - the first fetch size to try (default: 0, which means "use the default value suggested/calculated by the driver")
  • dbptk.jdbc.fetchsize.small (Integer) - the second fetch size to try, in case the first one caused an issue (default: 10)
  • dbptk.jdbc.fetchsize.minimum (Integer) - the last fetch size to try, in case the second one also caused an issue. This is the last try before giving up on fetching information from this table (default: 1)

Setting dbptk.jdbc.fetchsize.default to 1 fetches one row at a time, using minimal memory during the conversion but taking longer to convert the database.

For more details check https://github.com/keeps/db-preservation-toolkit/pull/292

Oracle

Controls the amount of LOB that is prefetch for each row retrieved from the database and stored in memory at once.

  • dbptk.jdbc.oracle.lobPrefetchSize (Integer) - This property allows to configure how much of the LOB data is fetched the first time is requested. (default: 4000 bytes)

For more details check https://github.com/keeps/db-preservation-toolkit/issues/437

SSH port range

Controls the open port search range by defining the minimum and maximum value to search for.

  • dbptk.ssh.port.findmin (Integer) - the minimum value to included (default: 1024)
  • dbptk.ssh.port.findmax (Integer) - the maximum value to included (default: 49151)

MapDB options

Controls the location of the directory where to save the off-heap file (depending on the size of the SIARD file this off-heap file can grow substantially)

  • dbptk.memory.dir (String) - the directory path for the off-heap file storage (default: an hidden folder named dbptk under your $HOME directory)

Timezone options

Controls the timestamp field handling from the Java. Thanks to @ateras

  • user.timezone=GMT - tells Java not to do any unexpected conversions when handling the timestamp fields