Use GETOPT for the command line scripts #326

pkiraly · Oct 18, 2023 · 4f92edb · 4f92edb
1 parent 9b1d5a7
commit 4f92edb
Show file tree

Hide file tree

Showing 11 changed files with 536 additions and 52 deletions.
diff --git a/README.md b/README.md
@@ -317,7 +317,7 @@ export JAR=target/metadata-qa-marc-0.6.0-jar-with-dependencies.jar
 
 Most of the analyses uses the following general parameters
 
-* `-F <type>`, `--schemaType <type>` metadata schema type. The supported types are:
+* `-w <type>`, `--schemaType <type>` metadata schema type. The supported types are:
   * `MARC21`
   * `PICA`
   * `UNIMARC` (assessment of UNIMARC records are not yet supported, this
@@ -370,10 +370,10 @@ Most of the analyses uses the following general parameters
   * `-q`, `--fixAlephseq` sometimes ALEPH export contains '^' characters
     instead of spaces in control fields (006, 007, 008). This flag replaces
     them with spaces before the validation. It might occur in any input format.
-  * `-X`, `--fixAlma` sometimes Alma export contains '#' characters instead of
+  * `-a`, `--fixAlma` sometimes Alma export contains '#' characters instead of
     spaces in control fields (006, 007, 008). This flag replaces them with
     spaces before the validation. It might occur in any input format.
-  * `-G`, `--fixKbr` KBR's export contains '#' characters instead spaces in
+  * `-b`, `--fixKbr` KBR's export contains '#' characters instead spaces in
     control fields (006, 007, 008). This flag replaces them with spaces before
     the validation. It might occur in any input format.
 * `-f <format>`, `--marcFormat <format>` The input format. Possible values are
@@ -408,7 +408,7 @@ Most of the analyses uses the following general parameters
   * `STREAM`: reading from a Java data stream. It is not usable if you use the
     tool from the command line, only if 
     you use it with its API.
-* `-I <configuration>`, `--allowableRecords <configuration>` if set, criteria
+* `-c <configuration>`, `--allowableRecords <configuration>` if set, criteria
   which allows analysis of records. If the record does not met the criteria, it
   will be excluded. An individual criterium should be formed as a MarcSpec (for
   MARC21 records) or PicaFilter (for PICA records). Multiple criteria might be 
@@ -419,20 +419,25 @@ Most of the analyses uses the following general parameters
   of which is problematic among multiple scripts, one can apply Base64 encoding.
   In this case add `base64:` prefix to the parameters, such as
   `base64:"$(echo '002@.0 !~ "^L" && 002@.0 !~ "^..[iktN]" && (002@.0 !~ "^.v" || 021A.a?)' | base64 -w 0)`.
+* `-1 <type>`, `--alephseqLineType <type>`, true, "Alephseq line type. The `type` could be
+  * `WITH_L`: the records' AlephSeq lines contain an `L ` string
+    (e.g. `000000002 008   L 780804s1977^^^^enk||||||b||||001^0|eng||`)
+  * `WITHOUT_L`: the records' AlephSeq lines do not contai an `L ` string
+    (e.g. `000000002 008   780804s1977^^^^enk||||||b||||001^0|eng||`)
 * PICA related parameters
-  * `-B <path>`, `--picaIdField <path>` the record identifier 
+  * `-2 <path>`, `--picaIdField <path>` the record identifier
+  * `-u <char>`, `--picaSubfieldSeparator <char>` the PICA subfield separator.
     subfield of PICA records. Default is `003@$0`.
-  * `-D <char>`, `--picaSubfieldSeparator <char>` the PICA subfield separator.
     Default is `$`.
-  * `-E <file>`, `--picaSchemaFile <file>` an Avram schema file, which describes
+  * `-j <file>`, `--picaSchemaFile <file>` an Avram schema file, which describes
     the structure of PICA records
-  * `-G <path>`, `--picaRecordType <path>` The PICA subfield which stores the
+  * `-k <path>`, `--picaRecordType <path>` The PICA subfield which stores the
     record type information. Default is `002@$0`.
 * Parameters for grouping analyses
-  * `-J <path>`, `--groupBy <path>` group the results by the value of this data 
+  * `-e <path>`, `--groupBy <path>` group the results by the value of this data 
     element (e.g. the ILN of libraries holding the item). An example: `--groupBy 001@$0`
     where `001@$0` is the subfield containing the comma separated list of library ILN codes.
-  * `-K <file>`, `--groupListFile <file>` the file which contains a list of ILN codes
+  * `-3 <file>`, `--groupListFile <file>` the file which contains a list of ILN codes
 
 The last argument of the commands are a list of files. It might contain any 
 wildcard the operating system supports ('*', '?', etc.).
@@ -476,26 +481,26 @@ options:
 
 * [general parameters](#general-parameters)
 * granularity of the report
-  * `-s`, `--summary`: creating a summary report instead of record level reports
-  * `-h`, `--details`: provides record level details of the issues
+  * `-S`, `--summary`: creating a summary report instead of record level reports
+  * `-H`, `--details`: provides record level details of the issues
 * output parameters:
-  * `-g <file>`, `--summaryFileName <file>`: the name of summary report the
+  * `-G <file>`, `--summaryFileName <file>`: the name of summary report the
     program produces. The file provides a summary of issues, such as the
     number of instance and number of records having the particular issue.
-  * `-f <file>`, `--detailsFileName <file>`: the name of report the program
+  * `-F <file>`, `--detailsFileName <file>`: the name of report the program
     produces. Default is `validation-report.txt`. If you use "stdout", it won't
     create file, but put results into the standard output.
-  * `-r <format>`, `--format <format>`: format specification of the output. Possible values:
+  * `-R <format>`, `--format <format>`: format specification of the output. Possible values:
     * `text` (default), 
     * `tab-separated` or `tsv`,
     * `comma-separated` or `csv`
-* `-w`, `--emptyLargeCollectors`: the output files are created during the
+* `-W`, `--emptyLargeCollectors`: the output files are created during the
   process and not only at the end of it. It helps in memory  management if the
   input is large, and it has lots of errors, on the other hand the output file
   will be segmented, which should be handled after the process.
-* `-t`, `--collectAllErrors`: collect all errors (useful only for validating
+* `-T`, `--collectAllErrors`: collect all errors (useful only for validating
   small number of records). Default is turned off.
-* `-i <types>`, `--ignorableIssueTypes <types>`: comma separated list of issue
+* `-I <types>`, `--ignorableIssueTypes <types>`: comma separated list of issue
   types not to collect. The valid values are:
   * `undetectableType`: undetectable type
   * `invalidLinkage`: invalid linkage
@@ -1006,10 +1011,14 @@ or
 options:
 
 * [general parameters](#general-parameters)
-* `-r <format>`, `--format <format>`: format specification of the output.
+* `-R <format>`, `--format <format>`: format specification of the output.
   Possible values are: 
   * `tab-separated` or `tsv`,
-  * `comma-separated` or `csv`
+  * `comma-separated` or `csv`,
+  * `text` or `txt`
+  * `json`
+* `-V`, `--advanced`: advanced mode (not yet implemented)
+* `-P`, `--onlyPackages`: only packages (not yet implemented)
 
 Output files:
 

diff --git a/completeness b/completeness
@@ -1,4 +1,113 @@
-# Calling Validator
+# Calling completeness
 . ./common-variables
+ME=$(basename $0)
 
-/usr/bin/java -Xmx2g -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness "$@"
+show_usage() { # display help message
+  cat <<EOF
+QA catalogue completeness analysis
+
+usage:
+ ${ME} [options] <files>
+
+options:
+ -m, --marcVersion <arg>            MARC version ('OCLC' or 'DNB')
+ -h, --help                         display help
+ -n, --nolog                        do not display log messages
+ -l, --limit <arg>                  limit the number of records to process
+ -o, --offset <arg>                 the first record to process
+ -i, --id <arg>                     the MARC identifier (content of 001)
+ -d, --defaultRecordType <arg>      the default record type if the record's type is undetectable
+ -q, --fixAlephseq                  fix the known issues of Alephseq format
+ -a, --fixAlma                      fix the known issues of Alma format
+ -b, --fixKbr                       fix the known issues of Alma format
+ -p, --alephseq                     the source is in Alephseq format
+ -x, --marcxml                      the source is in MARCXML format
+ -y, --lineSeparated                the source is in line separated MARC format
+ -t, --outputDir <arg>              output directory
+ -r, --trimId                       remove spaces from the end of record IDs
+ -z, --ignorableFields <arg>        ignore fields from the analysis
+ -v, --ignorableRecords <arg>       ignore records from the analysis
+ -f, --marcFormat <arg>             MARC format (like 'ISO' or 'MARCXML')
+ -s, --dataSource <arg>             data source (file of stream)
+ -g, --defaultEncoding <arg>        default character encoding
+ -1, --alephseqLineType <arg>       Alephseq line type
+ -2, --picaIdField <arg>            PICA id field
+ -u, --picaSubfieldSeparator <arg>  PICA subfield separator
+ -j, --picaSchemaFile <arg>         Avram PICA schema file
+ -w, --schemaType <arg>             metadata schema type ('MARC21', 'UNIMARC', or 'PICA')
+ -k, --picaRecordType <arg>         picaRecordType
+ -c, --allowableRecords <arg>       allow records for the analysis
+ -e, --groupBy <arg>                group the results by the value of this data element (e.g. the ILN of  library)
+ -3, --groupListFile <arg>          the file which contains a list of ILN codes
+ -R, --format <arg>                 specify a format
+ -V, --advanced                     advanced mode (not yet implemented)
+ -P, --onlyPackages                 only packages (not yet implemented)
+
+more info: https://github.com/pkiraly/qa-catalogue#calculating-data-element-completeness
+
+EOF
+  exit 1
+}
+
+if [ $# -eq 0 ]; then
+  show_usage
+fi
+
+SHORT_OPTIONS="m:hnl:o:i:d:qabpxyt:rz:v:f:s:g:1:2:u:j:w:k:c:e:3:R:VP"
+LONG_OPTIONS="marcVersion:,help,nolog,limit:,offset:,id:,defaultRecordType:,fixAlephseq,fixAlma,fixKbr,alephseq,marcxml,lineSeparated,outputDir:,trimId,ignorableFields:,ignorableRecords:,marcFormat:,dataSource:,defaultEncoding:,alephseqLineType:,picaIdField:,picaSubfieldSeparator:,picaSchemaFile:,schemaType:,picaRecordType:,allowableRecords:,groupBy:,groupListFile:,format:,advanced,onlyPackages"
+
+GETOPT=$(getopt \
+  -o ${SHORT_OPTIONS} \
+  --long ${LONG_OPTIONS} \
+  -n ${ME} -- "$@")
+eval set -- "${GETOPT}"
+
+PARAMS=""
+HELP=0
+while true ; do
+  case "$1" in
+    -m|--marcVersion)            PARAMS="$PARAMS --marcVersion $2" ;           shift 2 ;;
+    -h|--help)                   PARAMS="$PARAMS --help" ; HELP=1;             shift   ;;
+    -n|--nolog)                  PARAMS="$PARAMS --nolog" ;                    shift   ;;
+    -l|--limit)                  PARAMS="$PARAMS --limit $2" ;                 shift 2 ;;
+    -o|--offset)                 PARAMS="$PARAMS --offset $2" ;                shift 2 ;;
+    -i|--id)                     PARAMS="$PARAMS --id $2" ;                    shift 2 ;;
+    -d|--defaultRecordType)      PARAMS="$PARAMS --defaultRecordType $2" ;     shift 2 ;;
+    -q|--fixAlephseq)            PARAMS="$PARAMS --fixAlephseq" ;              shift   ;;
+    -a|--fixAlma)                PARAMS="$PARAMS --fixAlma" ;                  shift   ;;
+    -b|--fixKbr)                 PARAMS="$PARAMS --fixKbr" ;                   shift   ;;
+    -p|--alephseq)               PARAMS="$PARAMS --alephseq" ;                 shift   ;;
+    -x|--marcxml)                PARAMS="$PARAMS --marcxml" ;                  shift   ;;
+    -y|--lineSeparated)          PARAMS="$PARAMS --lineSeparated" ;            shift   ;;
+    -t|--outputDir)              PARAMS="$PARAMS --outputDir $2" ;             shift 2 ;;
+    -r|--trimId)                 PARAMS="$PARAMS --trimId" ;                   shift   ;;
+    -z|--ignorableFields)        PARAMS="$PARAMS --ignorableFields $2" ;       shift 2 ;;
+    -v|--ignorableRecords)       PARAMS="$PARAMS --ignorableRecords $2" ;      shift 2 ;;
+    -f|--marcFormat)             PARAMS="$PARAMS --marcFormat $2" ;            shift 2 ;;
+    -s|--dataSource)             PARAMS="$PARAMS --dataSource $2" ;            shift 2 ;;
+    -g|--defaultEncoding)        PARAMS="$PARAMS --defaultEncoding $2" ;       shift 2 ;;
+    -1|--alephseqLineType)       PARAMS="$PARAMS --alephseqLineType $2" ;      shift 2 ;;
+    -2|--picaIdField)            PARAMS="$PARAMS --picaIdField $2" ;           shift 2 ;;
+    -u|--picaSubfieldSeparator)  PARAMS="$PARAMS --picaSubfieldSeparator $2" ; shift 2 ;;
+    -j|--picaSchemaFile)         PARAMS="$PARAMS --picaSchemaFile $2" ;        shift 2 ;;
+    -w|--schemaType)             PARAMS="$PARAMS --schemaType $2" ;            shift 2 ;;
+    -k|--picaRecordType)         PARAMS="$PARAMS --picaRecordType $2" ;        shift 2 ;;
+    -c|--allowableRecords)       PARAMS="$PARAMS --allowableRecords $2" ;      shift 2 ;;
+    -e|--groupBy)                PARAMS="$PARAMS --groupBy $2" ;               shift 2 ;;
+    -3|--groupListFile)          PARAMS="$PARAMS --groupListFile $2" ;         shift 2 ;;
+    -R|--format)                 PARAMS="$PARAMS --format $2" ;                shift 2 ;;
+    -V|--advanced)               PARAMS="$PARAMS --advanced" ;                 shift   ;;
+    -P|--onlyPackages)           PARAMS="$PARAMS --onlyPackages" ;             shift   ;;
+    --) shift ; break ;;
+    *) echo "Internal error!: $1" ; exit 1 ;;
+  esac
+done
+
+if [[ $HELP -eq 1 ]]; then
+  show_usage
+fi
+
+CMD="/usr/bin/java -Xmx2g -cp $JAR de.gwdg.metadataqa.marc.cli.Completeness"
+
+echo $CMD $PARAMS "$@"
+$CMD $PARAMS "$@"
diff --git a/index b/index
@@ -57,11 +57,6 @@ if [ $# -eq 0 ]; then
   show_usage
 fi
 
-GETOPT=$(getopt -o b:p:m:ws::xard:hSpv:l:i:g:A:F:f:z:J:B:t:C: \
-  --long db:,file-path:,file-mask:,no-delete,solrFieldType:,marcxml,alephseq,trimId,defaultRecordType,help,status,purge,marcVersion:,limit:,ignorableRecords:,defaultEncoding:,alephseqLineType:,schemaType:,marcFormat:,ignorableFields:,groupBy:,validationCore:,outputDir:,outputDir,indexWithTokenizedField \
-  -n ${ME} -- "$@")
-eval set -- "$GETOPT"
-
 DB=""
 solrFieldType=mixed
 defaultRecordType=BOOKS
@@ -81,21 +76,27 @@ groupBy=""
 validationCore=""
 outputDir=""
 indexWithTokenizedField=""
+
+GETOPT=$(getopt -o b:p:m:ws::xard:hSpv:l:i:g:A:F:f:z:J:B:t:C: \
+  --long db:,file-path:,file-mask:,no-delete,solrFieldType:,marcxml,alephseq,trimId,defaultRecordType,help,status,purge,marcVersion:,limit:,ignorableRecords:,defaultEncoding:,alephseqLineType:,schemaType:,marcFormat:,ignorableFields:,groupBy:,validationCore:,outputDir:,outputDir,indexWithTokenizedField \
+  -n ${ME} -- "$@")
+eval set -- "$GETOPT"
+
 while true ; do
   case "$1" in
     -b|--db) DB=$2 ; shift 2;;
     -p|--file-path) FILE_PATH=$2 ; shift 2;;
     -m|--file-mask) FILE_MASK=$2 ; shift 2;;
     -w|--no-delete) DELETE=0 ; shift;;
     -s|--solrFieldType) solrFieldType=$2 ; shift 2;;
-    -r|--defaultRecordType) defaultRecordType=$2 ; shift 2;;
+    -d|--defaultRecordType) defaultRecordType=$2 ; shift 2;;
     -v|--marcVersion) marcVersion=$2 ; shift 2;;
     -l|--limit) limit="--limit $2"; shift 2;;
     -i|--ignorableRecords) ignorableRecords="--ignorableRecords $2"; shift 2;;
     -x|--marcxml) marcxml="--marcxml" ; shift;;
     -a|--alephseq) alephseq="--alephseq" ; shift;;
     -r|--trimId) trimId="--trimId" ; shift;;
-    -d|--defaultEncoding) defaultEncoding="--defaultEncoding $2" ; shift 2;;
+    -g|--defaultEncoding) defaultEncoding="--defaultEncoding $2" ; shift 2;;
     -A|--alephseqLineType) alephseqLineType="--alephseqLineType $2" ; shift 2;;
     -F|--schemaType) schemaType="--schemaType $2" ; shift 2;;
     -f|--marcFormat) marcFormat="--marcFormat $2" ; shift 2;;

diff --git a/scripts/cli-generator/README.md b/scripts/cli-generator/README.md
@@ -0,0 +1,6 @@
+This directory helps to generate the CLI scripts' help and getopt parts
+
+Usage:
+```bash
+php generate <file>
+```
diff --git a/scripts/cli-generator/completeness.txt b/scripts/cli-generator/completeness.txt
@@ -0,0 +1,32 @@
+      options.addOption("m", "marcVersion", true, "MARC version ('OCLC' or 'DNB')");
+      options.addOption("h", "help", false, "display help");
+      options.addOption("n", "nolog", false, "do not display log messages");
+      options.addOption("l", "limit", true, "limit the number of records to process");
+      options.addOption("o", "offset", true, "the first record to process");
+      options.addOption("i", "id", true, "the MARC identifier (content of 001)");
+      options.addOption("d", "defaultRecordType", true, "the default record type if the record's type is undetectable");
+      options.addOption("q", "fixAlephseq", false, "fix the known issues of Alephseq format");
+      options.addOption("a", "fixAlma", false, "fix the known issues of Alma format");
+      options.addOption("b", "fixKbr", false, "fix the known issues of Alma format");
+      options.addOption("p", "alephseq", false, "the source is in Alephseq format");
+      options.addOption("x", "marcxml", false, "the source is in MARCXML format");
+      options.addOption("y", "lineSeparated", false, "the source is in line separated MARC format");
+      options.addOption("t", "outputDir", true, "output directory");
+      options.addOption("r", "trimId", false, "remove spaces from the end of record IDs");
+      options.addOption("z", "ignorableFields", true, "ignore fields from the analysis");
+      options.addOption("v", "ignorableRecords", true, "ignore records from the analysis");
+      options.addOption("f", "marcFormat", true, "MARC format (like 'ISO' or 'MARCXML')");
+      options.addOption("s", "dataSource", true, "data source (file of stream)");
+      options.addOption("g", "defaultEncoding", true, "default character encoding");
+      options.addOption("1", "alephseqLineType", true, "Alephseq line type");
+      options.addOption("2", "picaIdField", true, "PICA id field");
+      options.addOption("u", "picaSubfieldSeparator", true, "PICA subfield separator");
+      options.addOption("j", "picaSchemaFile", true, "Avram PICA schema file");
+      options.addOption("w", "schemaType", true, "metadata schema type ('MARC21', 'UNIMARC', or 'PICA')");
+      options.addOption("k", "picaRecordType", true, "picaRecordType");
+      options.addOption("c", "allowableRecords", true, "allow records for the analysis");
+      options.addOption("e", "groupBy", true, "group the results by the value of this data element (e.g. the ILN of  library)");
+      options.addOption("3", "groupListFile", true, "the file which contains a list of ILN codes");
+      options.addOption("R", "format", true, "specify a format");
+      options.addOption("V", "advanced", false, "advanced mode (not yet implemented)");
+      options.addOption("P", "onlyPackages", false, "only packages (not yet implemented)");