Skip to content

Commit

Permalink
Use GETOPT for the command line scripts #326
Browse files Browse the repository at this point in the history
  • Loading branch information
pkiraly committed Oct 18, 2023
1 parent 9b1d5a7 commit 4f92edb
Show file tree
Hide file tree
Showing 11 changed files with 536 additions and 52 deletions.
49 changes: 29 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -317,7 +317,7 @@ export JAR=target/metadata-qa-marc-0.6.0-jar-with-dependencies.jar

Most of the analyses uses the following general parameters

* `-F <type>`, `--schemaType <type>` metadata schema type. The supported types are:
* `-w <type>`, `--schemaType <type>` metadata schema type. The supported types are:
* `MARC21`
* `PICA`
* `UNIMARC` (assessment of UNIMARC records are not yet supported, this
Expand Down Expand Up @@ -370,10 +370,10 @@ Most of the analyses uses the following general parameters
* `-q`, `--fixAlephseq` sometimes ALEPH export contains '^' characters
instead of spaces in control fields (006, 007, 008). This flag replaces
them with spaces before the validation. It might occur in any input format.
* `-X`, `--fixAlma` sometimes Alma export contains '#' characters instead of
* `-a`, `--fixAlma` sometimes Alma export contains '#' characters instead of
spaces in control fields (006, 007, 008). This flag replaces them with
spaces before the validation. It might occur in any input format.
* `-G`, `--fixKbr` KBR's export contains '#' characters instead spaces in
* `-b`, `--fixKbr` KBR's export contains '#' characters instead spaces in
control fields (006, 007, 008). This flag replaces them with spaces before
the validation. It might occur in any input format.
* `-f <format>`, `--marcFormat <format>` The input format. Possible values are
Expand Down Expand Up @@ -408,7 +408,7 @@ Most of the analyses uses the following general parameters
* `STREAM`: reading from a Java data stream. It is not usable if you use the
tool from the command line, only if
you use it with its API.
* `-I <configuration>`, `--allowableRecords <configuration>` if set, criteria
* `-c <configuration>`, `--allowableRecords <configuration>` if set, criteria
which allows analysis of records. If the record does not met the criteria, it
will be excluded. An individual criterium should be formed as a MarcSpec (for
MARC21 records) or PicaFilter (for PICA records). Multiple criteria might be
Expand All @@ -419,20 +419,25 @@ Most of the analyses uses the following general parameters
of which is problematic among multiple scripts, one can apply Base64 encoding.
In this case add `base64:` prefix to the parameters, such as
`base64:"$(echo '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)' | base64 -w 0)`.
* `-1 <type>`, `--alephseqLineType <type>`, true, "Alephseq line type. The `type` could be
* `WITH_L`: the records' AlephSeq lines contain an `L ` string
(e.g. `000000002 008 L 780804s1977^^^^enk||||||b||||001^0|eng||`)
* `WITHOUT_L`: the records' AlephSeq lines do not contai an `L ` string
(e.g. `000000002 008 780804s1977^^^^enk||||||b||||001^0|eng||`)
* PICA related parameters
* `-B <path>`, `--picaIdField <path>` the record identifier
* `-2 <path>`, `--picaIdField <path>` the record identifier
* `-u <char>`, `--picaSubfieldSeparator <char>` the PICA subfield separator.
subfield of PICA records. Default is `003@$0`.
* `-D <char>`, `--picaSubfieldSeparator <char>` the PICA subfield separator.
Default is `$`.
* `-E <file>`, `--picaSchemaFile <file>` an Avram schema file, which describes
* `-j <file>`, `--picaSchemaFile <file>` an Avram schema file, which describes
the structure of PICA records
* `-G <path>`, `--picaRecordType <path>` The PICA subfield which stores the
* `-k <path>`, `--picaRecordType <path>` The PICA subfield which stores the
record type information. Default is `002@$0`.
* Parameters for grouping analyses
* `-J <path>`, `--groupBy <path>` group the results by the value of this data
* `-e <path>`, `--groupBy <path>` group the results by the value of this data
element (e.g. the ILN of libraries holding the item). An example: `--groupBy 001@$0`
where `001@$0` is the subfield containing the comma separated list of library ILN codes.
* `-K <file>`, `--groupListFile <file>` the file which contains a list of ILN codes
* `-3 <file>`, `--groupListFile <file>` the file which contains a list of ILN codes

The last argument of the commands are a list of files. It might contain any
wildcard the operating system supports ('*', '?', etc.).
Expand Down Expand Up @@ -476,26 +481,26 @@ options:

* [general parameters](#general-parameters)
* granularity of the report
* `-s`, `--summary`: creating a summary report instead of record level reports
* `-h`, `--details`: provides record level details of the issues
* `-S`, `--summary`: creating a summary report instead of record level reports
* `-H`, `--details`: provides record level details of the issues
* output parameters:
* `-g <file>`, `--summaryFileName <file>`: the name of summary report the
* `-G <file>`, `--summaryFileName <file>`: the name of summary report the
program produces. The file provides a summary of issues, such as the
number of instance and number of records having the particular issue.
* `-f <file>`, `--detailsFileName <file>`: the name of report the program
* `-F <file>`, `--detailsFileName <file>`: the name of report the program
produces. Default is `validation-report.txt`. If you use "stdout", it won't
create file, but put results into the standard output.
* `-r <format>`, `--format <format>`: format specification of the output. Possible values:
* `-R <format>`, `--format <format>`: format specification of the output. Possible values:
* `text` (default),
* `tab-separated` or `tsv`,
* `comma-separated` or `csv`
* `-w`, `--emptyLargeCollectors`: the output files are created during the
* `-W`, `--emptyLargeCollectors`: the output files are created during the
process and not only at the end of it. It helps in memory management if the
input is large, and it has lots of errors, on the other hand the output file
will be segmented, which should be handled after the process.
* `-t`, `--collectAllErrors`: collect all errors (useful only for validating
* `-T`, `--collectAllErrors`: collect all errors (useful only for validating
small number of records). Default is turned off.
* `-i <types>`, `--ignorableIssueTypes <types>`: comma separated list of issue
* `-I <types>`, `--ignorableIssueTypes <types>`: comma separated list of issue
types not to collect. The valid values are:
* `undetectableType`: undetectable type
* `invalidLinkage`: invalid linkage
Expand Down Expand Up @@ -1006,10 +1011,14 @@ or
options:

* [general parameters](#general-parameters)
* `-r <format>`, `--format <format>`: format specification of the output.
* `-R <format>`, `--format <format>`: format specification of the output.
Possible values are:
* `tab-separated` or `tsv`,
* `comma-separated` or `csv`
* `comma-separated` or `csv`,
* `text` or `txt`
* `json`
* `-V`, `--advanced`: advanced mode (not yet implemented)
* `-P`, `--onlyPackages`: only packages (not yet implemented)

Output files:

Expand Down
113 changes: 111 additions & 2 deletions completeness
Original file line number Diff line number Diff line change
@@ -1,4 +1,113 @@
# Calling Validator
# Calling completeness
. ./common-variables
ME=$(basename $0)

/usr/bin/java -Xmx2g -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness "$@"
show_usage() { # display help message
cat <<EOF
QA catalogue completeness analysis

usage:
${ME} [options] <files>

options:
-m, --marcVersion <arg> MARC version ('OCLC' or 'DNB')
-h, --help display help
-n, --nolog do not display log messages
-l, --limit <arg> limit the number of records to process
-o, --offset <arg> the first record to process
-i, --id <arg> the MARC identifier (content of 001)
-d, --defaultRecordType <arg> the default record type if the record's type is undetectable
-q, --fixAlephseq fix the known issues of Alephseq format
-a, --fixAlma fix the known issues of Alma format
-b, --fixKbr fix the known issues of Alma format
-p, --alephseq the source is in Alephseq format
-x, --marcxml the source is in MARCXML format
-y, --lineSeparated the source is in line separated MARC format
-t, --outputDir <arg> output directory
-r, --trimId remove spaces from the end of record IDs
-z, --ignorableFields <arg> ignore fields from the analysis
-v, --ignorableRecords <arg> ignore records from the analysis
-f, --marcFormat <arg> MARC format (like 'ISO' or 'MARCXML')
-s, --dataSource <arg> data source (file of stream)
-g, --defaultEncoding <arg> default character encoding
-1, --alephseqLineType <arg> Alephseq line type
-2, --picaIdField <arg> PICA id field
-u, --picaSubfieldSeparator <arg> PICA subfield separator
-j, --picaSchemaFile <arg> Avram PICA schema file
-w, --schemaType <arg> metadata schema type ('MARC21', 'UNIMARC', or 'PICA')
-k, --picaRecordType <arg> picaRecordType
-c, --allowableRecords <arg> allow records for the analysis
-e, --groupBy <arg> group the results by the value of this data element (e.g. the ILN of library)
-3, --groupListFile <arg> the file which contains a list of ILN codes
-R, --format <arg> specify a format
-V, --advanced advanced mode (not yet implemented)
-P, --onlyPackages only packages (not yet implemented)

more info: https://github.com/pkiraly/qa-catalogue#calculating-data-element-completeness

EOF
exit 1
}

if [ $# -eq 0 ]; then
show_usage
fi

SHORT_OPTIONS="m:hnl:o:i:d:qabpxyt:rz:v:f:s:g:1:2:u:j:w:k:c:e:3:R:VP"
LONG_OPTIONS="marcVersion:,help,nolog,limit:,offset:,id:,defaultRecordType:,fixAlephseq,fixAlma,fixKbr,alephseq,marcxml,lineSeparated,outputDir:,trimId,ignorableFields:,ignorableRecords:,marcFormat:,dataSource:,defaultEncoding:,alephseqLineType:,picaIdField:,picaSubfieldSeparator:,picaSchemaFile:,schemaType:,picaRecordType:,allowableRecords:,groupBy:,groupListFile:,format:,advanced,onlyPackages"

GETOPT=$(getopt \
-o ${SHORT_OPTIONS} \
--long ${LONG_OPTIONS} \
-n ${ME} -- "$@")
eval set -- "${GETOPT}"

PARAMS=""
HELP=0
while true ; do
case "$1" in
-m|--marcVersion) PARAMS="$PARAMS --marcVersion $2" ; shift 2 ;;
-h|--help) PARAMS="$PARAMS --help" ; HELP=1; shift ;;
-n|--nolog) PARAMS="$PARAMS --nolog" ; shift ;;
-l|--limit) PARAMS="$PARAMS --limit $2" ; shift 2 ;;
-o|--offset) PARAMS="$PARAMS --offset $2" ; shift 2 ;;
-i|--id) PARAMS="$PARAMS --id $2" ; shift 2 ;;
-d|--defaultRecordType) PARAMS="$PARAMS --defaultRecordType $2" ; shift 2 ;;
-q|--fixAlephseq) PARAMS="$PARAMS --fixAlephseq" ; shift ;;
-a|--fixAlma) PARAMS="$PARAMS --fixAlma" ; shift ;;
-b|--fixKbr) PARAMS="$PARAMS --fixKbr" ; shift ;;
-p|--alephseq) PARAMS="$PARAMS --alephseq" ; shift ;;
-x|--marcxml) PARAMS="$PARAMS --marcxml" ; shift ;;
-y|--lineSeparated) PARAMS="$PARAMS --lineSeparated" ; shift ;;
-t|--outputDir) PARAMS="$PARAMS --outputDir $2" ; shift 2 ;;
-r|--trimId) PARAMS="$PARAMS --trimId" ; shift ;;
-z|--ignorableFields) PARAMS="$PARAMS --ignorableFields $2" ; shift 2 ;;
-v|--ignorableRecords) PARAMS="$PARAMS --ignorableRecords $2" ; shift 2 ;;
-f|--marcFormat) PARAMS="$PARAMS --marcFormat $2" ; shift 2 ;;
-s|--dataSource) PARAMS="$PARAMS --dataSource $2" ; shift 2 ;;
-g|--defaultEncoding) PARAMS="$PARAMS --defaultEncoding $2" ; shift 2 ;;
-1|--alephseqLineType) PARAMS="$PARAMS --alephseqLineType $2" ; shift 2 ;;
-2|--picaIdField) PARAMS="$PARAMS --picaIdField $2" ; shift 2 ;;
-u|--picaSubfieldSeparator) PARAMS="$PARAMS --picaSubfieldSeparator $2" ; shift 2 ;;
-j|--picaSchemaFile) PARAMS="$PARAMS --picaSchemaFile $2" ; shift 2 ;;
-w|--schemaType) PARAMS="$PARAMS --schemaType $2" ; shift 2 ;;
-k|--picaRecordType) PARAMS="$PARAMS --picaRecordType $2" ; shift 2 ;;
-c|--allowableRecords) PARAMS="$PARAMS --allowableRecords $2" ; shift 2 ;;
-e|--groupBy) PARAMS="$PARAMS --groupBy $2" ; shift 2 ;;
-3|--groupListFile) PARAMS="$PARAMS --groupListFile $2" ; shift 2 ;;
-R|--format) PARAMS="$PARAMS --format $2" ; shift 2 ;;
-V|--advanced) PARAMS="$PARAMS --advanced" ; shift ;;
-P|--onlyPackages) PARAMS="$PARAMS --onlyPackages" ; shift ;;
--) shift ; break ;;
*) echo "Internal error!: $1" ; exit 1 ;;
esac
done

if [[ $HELP -eq 1 ]]; then
show_usage
fi

CMD="/usr/bin/java -Xmx2g -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness"

echo $CMD $PARAMS "$@"
$CMD $PARAMS "$@"
15 changes: 8 additions & 7 deletions index
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,6 @@ if [ $# -eq 0 ]; then
show_usage
fi

GETOPT=$(getopt -o b:p:m:ws::xard:hSpv:l:i:g:A:F:f:z:J:B:t:C: \
--long db:,file-path:,file-mask:,no-delete,solrFieldType:,marcxml,alephseq,trimId,defaultRecordType,help,status,purge,marcVersion:,limit:,ignorableRecords:,defaultEncoding:,alephseqLineType:,schemaType:,marcFormat:,ignorableFields:,groupBy:,validationCore:,outputDir:,outputDir,indexWithTokenizedField \
-n ${ME} -- "$@")
eval set -- "$GETOPT"

DB=""
solrFieldType=mixed
defaultRecordType=BOOKS
Expand All @@ -81,21 +76,27 @@ groupBy=""
validationCore=""
outputDir=""
indexWithTokenizedField=""

GETOPT=$(getopt -o b:p:m:ws::xard:hSpv:l:i:g:A:F:f:z:J:B:t:C: \
--long db:,file-path:,file-mask:,no-delete,solrFieldType:,marcxml,alephseq,trimId,defaultRecordType,help,status,purge,marcVersion:,limit:,ignorableRecords:,defaultEncoding:,alephseqLineType:,schemaType:,marcFormat:,ignorableFields:,groupBy:,validationCore:,outputDir:,outputDir,indexWithTokenizedField \
-n ${ME} -- "$@")
eval set -- "$GETOPT"

while true ; do
case "$1" in
-b|--db) DB=$2 ; shift 2;;
-p|--file-path) FILE_PATH=$2 ; shift 2;;
-m|--file-mask) FILE_MASK=$2 ; shift 2;;
-w|--no-delete) DELETE=0 ; shift;;
-s|--solrFieldType) solrFieldType=$2 ; shift 2;;
-r|--defaultRecordType) defaultRecordType=$2 ; shift 2;;
-d|--defaultRecordType) defaultRecordType=$2 ; shift 2;;
-v|--marcVersion) marcVersion=$2 ; shift 2;;
-l|--limit) limit="--limit $2"; shift 2;;
-i|--ignorableRecords) ignorableRecords="--ignorableRecords $2"; shift 2;;
-x|--marcxml) marcxml="--marcxml" ; shift;;
-a|--alephseq) alephseq="--alephseq" ; shift;;
-r|--trimId) trimId="--trimId" ; shift;;
-d|--defaultEncoding) defaultEncoding="--defaultEncoding $2" ; shift 2;;
-g|--defaultEncoding) defaultEncoding="--defaultEncoding $2" ; shift 2;;
-A|--alephseqLineType) alephseqLineType="--alephseqLineType $2" ; shift 2;;
-F|--schemaType) schemaType="--schemaType $2" ; shift 2;;
-f|--marcFormat) marcFormat="--marcFormat $2" ; shift 2;;
Expand Down
6 changes: 6 additions & 0 deletions scripts/cli-generator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
This directory helps to generate the CLI scripts' help and getopt parts

Usage:
```bash
php generate <file>
```
32 changes: 32 additions & 0 deletions scripts/cli-generator/completeness.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
options.addOption("m", "marcVersion", true, "MARC version ('OCLC' or 'DNB')");
options.addOption("h", "help", false, "display help");
options.addOption("n", "nolog", false, "do not display log messages");
options.addOption("l", "limit", true, "limit the number of records to process");
options.addOption("o", "offset", true, "the first record to process");
options.addOption("i", "id", true, "the MARC identifier (content of 001)");
options.addOption("d", "defaultRecordType", true, "the default record type if the record's type is undetectable");
options.addOption("q", "fixAlephseq", false, "fix the known issues of Alephseq format");
options.addOption("a", "fixAlma", false, "fix the known issues of Alma format");
options.addOption("b", "fixKbr", false, "fix the known issues of Alma format");
options.addOption("p", "alephseq", false, "the source is in Alephseq format");
options.addOption("x", "marcxml", false, "the source is in MARCXML format");
options.addOption("y", "lineSeparated", false, "the source is in line separated MARC format");
options.addOption("t", "outputDir", true, "output directory");
options.addOption("r", "trimId", false, "remove spaces from the end of record IDs");
options.addOption("z", "ignorableFields", true, "ignore fields from the analysis");
options.addOption("v", "ignorableRecords", true, "ignore records from the analysis");
options.addOption("f", "marcFormat", true, "MARC format (like 'ISO' or 'MARCXML')");
options.addOption("s", "dataSource", true, "data source (file of stream)");
options.addOption("g", "defaultEncoding", true, "default character encoding");
options.addOption("1", "alephseqLineType", true, "Alephseq line type");
options.addOption("2", "picaIdField", true, "PICA id field");
options.addOption("u", "picaSubfieldSeparator", true, "PICA subfield separator");
options.addOption("j", "picaSchemaFile", true, "Avram PICA schema file");
options.addOption("w", "schemaType", true, "metadata schema type ('MARC21', 'UNIMARC', or 'PICA')");
options.addOption("k", "picaRecordType", true, "picaRecordType");
options.addOption("c", "allowableRecords", true, "allow records for the analysis");
options.addOption("e", "groupBy", true, "group the results by the value of this data element (e.g. the ILN of library)");
options.addOption("3", "groupListFile", true, "the file which contains a list of ILN codes");
options.addOption("R", "format", true, "specify a format");
options.addOption("V", "advanced", false, "advanced mode (not yet implemented)");
options.addOption("P", "onlyPackages", false, "only packages (not yet implemented)");
Loading

0 comments on commit 4f92edb

Please sign in to comment.