Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

marc-matcher - a macro for working with MARC data #4

Open
hzafar opened this issue Jul 8, 2021 · 4 comments
Open

marc-matcher - a macro for working with MARC data #4

hzafar opened this issue Jul 8, 2021 · 4 comments

Comments

@hzafar
Copy link

hzafar commented Jul 8, 2021

Please enter the bee by submitting code (or links to code) for:

  1. your macro
  2. an example use of your macro
  3. (optional) "before" code that your macro helps to improve

Thank you for your submission!

If your entry is a PR to the syntax parse examples repository, please include a link to the PR.

Macro

This is a very domain-specific macro, developed for a particular bibliographic metadata use-case. The macro definition itself is given below, and the required files containing helper definitions have been attached to this issue.

#lang racket

(require syntax/parse/define
         "marc-matcher-syntax-classes.rkt"
         "marc-matcher-helpers.rkt")

(define-syntax (marc-matcher stx)
  (syntax-parse stx
    [(_ (var:marc-var-defn ...) body:expr ...)
     (define params #'(var.name ...))
     (define regexps #'(var.re ...))
     #`(λ (input [sep "$"])
         (define args (get-subfield-data '#,regexps input sep))
         (apply (λ #,params (begin body ...)) (map simplify-groups args)))]))

This macro aims to make it easier to do regex-like matching over a structured bibliographic data format known as MARC 21. MARC records contain a sequence of fields whose data are string values that look like this:

$aCarroll, Lewis,$d1832-1898,$eauthor.

In each field, individual subfields are separated using a separator character (in this case $); the character immediately following the separator is called the subtag; and the substring upto the next separator or end-of-string is the subfield data. So in the example above, there are three subfields, $a, $d, and $e, whose data are, respectively, Carroll, Lewis,, 1832-1898,, and author..

Parsing subfields out of this is often done using regular expressions, but it gets really difficult when trying to deal with subfield repetitions. I'll use field 264 to illustrate. This field mainly contains the following pieces of publication information: the $a subfield contains place of publication; the $b contains the entity responsible for publication; and the $c contains the date of publication. There are several possible repetition patterns for these subfields which require different semantic interpretations. To give a few examples:

  • a+bc: multiple places of publication with the same publisher
    • $aLondon ;$aNew York :$bRoutledge,$c2017.[1]
  • ab+c: multiple publishers with the same place of publication
    • $aNew York, NY :$bBarnes & Noble :$bSterling Publishing Co., Inc.,$c2012.[2]
  • (ab)+c: multiple publications, each with different places and publishers
    • $aBoston :$bLee and Shepard, publishers ;$aNew York :$bLee, Shepard, and Dillingham,$c1872.[3]

Writing a regex to intelligently parse this information out of the string is a pain, but regexes are an already popular and well understood tool in the metadata community. Thus, marc-matcher lets users specify regular expressions that match subgroups within the field they want to parse, and define variables they can use in their code containing the results of those matches, which allows more complex kinds of processing to be done with simpler code.

Example

Illustrate one or more ways of using your macro.
Please show code and briefly describe what it does.

This example defines a lambda called parse-264 using marc-matcher:

(define parse-264
  (marc-matcher ([#px"ab" #:as place-entity-groups]
                 [#px"c" #:as date])
                (for/list ([group place-entity-groups])
                  (cons (subfield-data date) (map subfield-data group)))))

The first clause of the marc-matcher expression is a list of variable definitions, similar to a parameter list for a lambda. For example, [#px"ab" #:as place-entity-groups] defines a variable called place-entity-groups, which will be a list of all the groups (which are themselves lists of structs) consisting of a single subfield $a followed by a single subfield $b. The second clause is the computation the user wishes to do with the values extracted from the field, and can refer to the variables defined in the first clause.

The parse-264 function above can then be used as follows:

> (parse-264 "$aBoston :$bLee and Shepard, publishers ;$aNew York :$bLee, Shepard, and Dillingham,$c1872.")
'(("1872." "Boston :" "Lee and Shepard, publishers ;") ("1872." "New York :" "Lee, Shepard, and Dillingham,"))

Here is another example, using table of contents data[4]:

> ((marc-matcher ([#px"tr?" #:as title-info-groups])
               (for ([group title-info-groups])
                 (define title (first (map subfield-data
                                           (filter (λ (sf) (equal? "t" (subfield-subtag sf))) group))))
                 (define authors (map subfield-data
                                      (filter (λ (sf) (equal? "r" (subfield-subtag sf))) group)))
                 (printf "Title: ~a~a~n~n" (string-trim title #px"( /\\s*)|( --\\s*)|\\.")
                         (if (empty? authors) "" (string-append "\nAuthor: "
                                                                (string-trim (first authors)
                                                                             #px"( /\\s*)|( --\\s*)|\\."))))))               
 (string-join '("$tCaveat Lector; or how I ransacked Wikipedias across the Multiverse soley "
                "to amuse and edify readers -- $tMystery of the missing mothers / $rKristin King -- "
                "$tSecrets of Flatland / $rAnne Toole -- $tSanyo TM-300 Home-Use Time Machine / "
                "$rJeremy Sim -- $tElizabeth Burgoyne Corbett / $rL. Timmel Duchamp -- "
                "$tBiographies.") ""))
Title: Caveat Lector; or how I ransacked Wikipedias across the Multiverse soley to amuse and edify readers

Title: Mystery of the missing mothers
Author: Kristin King

Title: Secrets of Flatland
Author: Anne Toole

Title: Sanyo TM-300 Home-Use Time Machine
Author: Jeremy Sim

Title: Elizabeth Burgoyne Corbett
Author: L. Timmel Duchamp

Title: Biographies

Before and After

If you designed your macro to improve some existing code, please explain the improvements.

Use the following categories if applicable:

  • Code Cleaning : Please share the code that you used to write before creating your macro. Briefly explain how the code works.
  • Macro Engineering : Please share the old macro that you revised. Briefly explain the changes.

This would probably count as a code cleaning macro, though the before code doesn't exist (because I've not previously done this kind of metadata work in Racket).

Licence

Please confirm that you are submitting this code under the same MIT License that the Racket language uses. https://github.com/racket/racket/blob/master/racket/src/LICENSE-MIT.txt
Please confirm that the associated text is licensed under the Creative Commons Attribution 4.0 International License http://creativecommons.org/licenses/by/4.0/

I confirm that the code is under the same MIT license as the Racket language, and associated text is under Creative Commons Attribution 4.0 International License

Contact

To receive prizes and/or provide feedback please complete
the form at https://forms.gle/Z5CN2xzK13dfkBnF7 (google account not required / email optional).

@spdegabrielle
Copy link
Contributor

Awesome! Now I need a z39.50 client!

@hzafar
Copy link
Author

hzafar commented Jul 26, 2021

A Racket one would be nice! 😆

@spdegabrielle
Copy link
Contributor

if only I had time - and I switched from libraries to health about 12 years ago so I'm into HL7 instead of MARC21 now.

There is an ASN.1 Library if it is of interest
https://docs.racket-lang.org/asn1

@spdegabrielle
Copy link
Contributor

Thank you for your contribution!

If you haven’t already please take the time to fill in the form https://forms.gle/Z5CN2xzK13dfkBnF7

Bw
Stephen

bennn added a commit to syntax-objects/syntax-parse-example that referenced this issue Sep 28, 2021
bennn added a commit to syntax-objects/syntax-parse-example that referenced this issue Oct 27, 2021
bennn added a commit to syntax-objects/syntax-parse-example that referenced this issue Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants