Powerful platforms like Elasticsearch & Next.js make it possible for museums to easily build performant, responsive and accessible faceted searches for their online collections.
I've stopped work in this repository, replacing it with a new project:
Repository: https://github.com/derekphilipau/musefully
Website: http://musefully.org/
Musefully allows for ingesting datasources from various sources, including collections, content, and rss feeds.
This project has been deployed on Vercel at https://bkm-next-search.vercel.app/
OpenAI CLIP Embeddings similarity feature is in the feature-experimental-clip branch. The embeddings were slowing down my test Elasticsearch instance, so I've taken down the Vercel deployment. You can see examples of artwork similarity here.
A typical approach for building a collections website is to periodically sync data from a backend collections management system (sometimes augmented with data from an internal CMS) into a relational database which is queried by a frontend website.
This project takes a different approach, using Elasticsearch as the primary data store and Next.js as the frontend. Note that the collections data is read-only, the actual datastore is in the backend system. Using the last exported files, the Elasticsearch indices can be rebuilt within a handful of minutes, even with collections of 200,000 documents.
TODO: Implement the "Cloud Function Periodic Sync" using AWS Lambda or Google Cloud Functions. For the time being, Elasticsearch indices are updated manually via command line scripts.
Two methods are offered for updating indices: Update and insert.
The update method loads data from a JSONL file and updates documents in an index. If the document doesn't already exist, it is created. Document fields are updated with new information, but fields not contained in the data file are not updated. For example, the image.dominantColor field is not present in the original data file, as it is populated by another script, so that information is preserved across updates. Note that it may be necessary to force an update of calculated fields, for example if the primary image of an object has changed. Also note that approach will fail for nested fields.
The insert method completely repopulates an index from a JSONL data file. To avoid any downtime, first a timestamped index is created and populated, then the alias is pointed at the newly-created index. This will completely rebuild an index with each sync, and any calculated field values will be lost.
All data was collected via the Brooklyn Museum Open API.
The Archives data was collected using the OAI-PMH harvesting API of Brooklyn Museum's ArchivesSpace service.
The Dublin Core metadata is limited. It would be better to use ArchivesSpace's native API to index all the metadata fields like Language, Type, Names, etc.
It's often necessary to augment backend collection data with additional metadata. For example, theme tags like "Climate Justice" might be associated with artworks in a CMS rather the backend collections management system. The "sortPriority" field allows one to prominently display specific documents by adjusting the ordering in default searches. This additional metadata is stored in data/BrooklynMuseum/additionalMetadata.jsonl
, but could just as easily be exported from a CMS.
ULAN XML was downloaded from Getty's website and converted to JSON using the transformUlan.ts
script. When updating the terms
index, the script attempts to find a matching artist name from this JSON file. If found, the ULAN artist data is added to the terms index document.
This project uses Elasticsearch Query DSL with the Elasticsearch Javascript Client to manage indices and query data.
Basic Elasticsearch index, field types, analyzers, and filters are defined in util/elasticsearch/settings.ts
.
Adjust number_of_shards and number_of_replicas for your use case.
hyphenApostropheMappingFilter
- replaces hyphens with spaces and removes single quotes.articleCharFilter
- UNUSED - replaces common articles (de, van, der, etc.) with spaces. Originally intended to help with searching names containing many articles.
enSnowball
- stems words using a Snowball-generated stemmer for English language
unaggregatedStandardAnalyzer
- For common text fields that are not aggregated.aggregatedKeywordAnalyzer
- For aggregated keyword fieldssuggestAnalyzer
- For search_as_you_type fields
Most definitions are straight-forward. Some search and suggest fields contain subfields "search" and "suggest" used for those use-cases.
Indices are defined in /util/elasticsearch/indices.ts
.
Some object fields are re-used across multiple indices.
Constituent:
id
- Source-dependent ID of the constituentname
- Name of the constituent, e.g. "Pablo Picasso"dates
- A free-form string representing the dates of a constituent, often the birth & death of an artist, e.g. "ca. 1483–1556"birthYear
- Birth year of the constituentdeathYear
- Death year of the constituentnationality
- Array of nationalities of the constituentgender
- Gender. Note that ULAN seems to only record 'Male', 'Female', and 'N/A'role
- Role of the constituent, e.g. "Artist", "Maker", "Photographer", etc.source
- Source of the constituent, e.g. "Brooklyn Museum", "Getty ULAN"sourceId
- Source-dependent ID of the constituentwikiQid
- Wikidata QID of the constituentulanId
- ULAN ID of the constituent
Geographical Location:
id
- Source-dependent ID of the locationname
- Name of the location, e.g. "New York, New York, United States"continent
- Continent of the location, e.g. "North America"country
- Country of the location, e.g. "United States"type
- Type of location, e.g. "City", "State", "Country", etc.
Image:
url
- The URL of the imagethumbnailUrl
- The URL of the thumbnailalt
- The alt text for the imagedominantColors
- An array of arrays of HSL colors and other information used for color searchyear
- The year of the imageview
- The view of the image, e.g. "front", "back", "detail", etc.rank
- The rank of the image, used for sortingembedding
- Experimental Feature. CLIP image embedding for similarity & text search. Removed. See examples here.
Museum Location:
id
- Source-dependent ID of the locationname
- Name of the location, e.g. ""isPublic
- Whether the location is publicisFloor
- Whether the location is a floorparentId
- The ID of the parent location
The base document defines common fields for all indices, these are the fields used for cross-index search. The Elasticsearch Base Document fields are defined in indices.ts
and the associated Typescript interface is defined in /types/baseDocument.ts
.
type
- The type of document, e.g. "collections", "archives", "terms"source
- The source of the document, e.g. "Brooklyn Museum", "Getty ULAN"url
- The URL of the documentid
- The unique ID of the documenttitle
- The title of the documentdescription
- The description of the documentsearchText
- The text used for full-text search. This can be configured on a per-index basis to allow global search to include special fields like accession number.keywords
- An array of keywords for the documentboostedKeywords
- An array of keywords that should be boosted in search resultsprimaryConstituent
- The primary constituent of the document, e.g. the artist of a painting.image
- Image. The main image of the documentdate
- Date the document was created, not currently used.formattedDate
- A string representing the date, no strict format.startYear
- An integer representing the start date year. Used for year range filtering.endYear
- An integer representing the end date year. Used for year range filtering.sortPriority
- Integer representing the priority or weight of a document. Allows for default search results customization.
Note on dates: Museum objects have a wide range of dates from pre-historic BCE to contemporary CE that ISO 8601 cannot represent, hence the use of signed integers to represent years.
Includes all Base Document fields as well as:
constituents
- Constituent array. Entities associated with the document, e.g. artists, photographers, organization, etc.images
- Image array. Images associated with the document.accessionNumber
- The accession number.accessionDate
- Free-form date field for accession date.period
- The period, e.g. "Edo Period", "Middle Kingdom", etc.dynasty
- The dynasty, e.g. "Qing Dynasty", "Mughal", etc.provenance
- Free-text field describing provenance.medium
- The medium, e.g. "Oil on canvas", "Woodblock print", etc.dimensions
- The dimensions, e.g. "Sheet: 14 1/2 x 10 1/4 in. (36.8 x 26 cm)" TODO: Normalize dimensions into standardized fields.edition
- The edition, e.g. "Edition: 23/50"portfolio
- The portfolio, e.g. "Scenes from the Life of Saint Lawrence"markings
- Markings on object, e.g. "Stamped on back: 'HERTER BRO'S.'"signed
- Signature on object, e.g. "Kunichika ga 国周画"inscribed
- Inscription on objectcreditLine
- Credit line, e.g. "Dick S. Ramsay Fund"copyright
- Copyright, e.g. "© Park McArthur"classification
- Classification, e.g. "Print", "Sculpture", "Painting", etc.publicAccess
- Boolean, if true is public access.copyrightRestricted
- Boolean, if true images are restricted.highlight
- Boolean whether or not object is highlighted. TODO: Remove, Brooklyn Museum-specific.section
- Museum-specific gallery section, e.g. "Old Kingdom"museumLocation
- Museum Location. Museum-specific location within museumonView
- Whether or not the object is currently on view.rightsType
- Specifies copyright type, e.g. "Creative Commons-BY"labels
- Array of gallery labels. TODO: Define type & add to searchText?collections
- An array of collections the object belongs to.exhibitions
- An array of exhibitions the object has been in. TODO: Assumes exhibitions have unique names.geographicLocations
- Geographical Location array. Geographic locations associated with the object.primaryGeographicalLocation
- Geographical Location. The primary location associated with the object.
Content documents represent a web page or resource, typically from a museum's website. The fields are the same as Base Document.
Archives documents represent archival collections. The fields are the same as Base Document with the addition of:
accessionNumber
- (dc:identifier
) The accession number.primaryConstituent
- (dc:creator
) Primary constituent, often the primary maker, e.g. the artist.subject
- (dc:subject
) The subject of the archival collection.language
- (dc:language
) The language of the archival collection, e.g. "en".publisher
- (dc:publisher
) The publisher of the record, e.g. "Brooklyn Museum Archives"format
- (dc:format
) e.g. "17.916 Linear Feet; 43 document boxes"rights
- (dc:rights
) e.g. "Collection is open for research; permission of archivist required..."relation
- (dc:relation
) e.g. "Office of the Director records, DIR"
Terms documents represent terms from a controlled vocabulary. These are queried for "did you mean?" searches. The fields are the same as Base Document with the addition of:
sourceId
: The ID of the term in the source vocabulary.sourceType
: The type of the term within the source vocabulary.index
: The index the term belongs to, e.g. "collections".field
: The field the term belongs to, e.g. "classification", "primaryConstituent.name"value
: The value of the term, e.g. "Painting", "Pablo Picasso", etc.preferred
: The preferred term, e.g. "Pablo Picasso"alternates
: An array of alternate terms, e.g. ["Picasso, Pablo", "Picasso", etc.]summary
: A summary of the term. Deprecated, use data fields instead.description
: A description of the term. Deprecated, use data fields instead.data
: The raw data of the term, e.g. the JSON from the Getty ULAN.
Text queries are currently searched with multi_match default best_fields
. Fields can be weighted to give priority, in this case boostedKeywords
is very heavily weighted for cases where you want a document to appear first if it contains an important keyword.
multi_match: {
query: q,
type: 'best_fields',
operator: 'and',
fields: [
'boostedKeywords^20',
'constituents^4', // TODO
'title^2',
'keywords^2',
'description',
'searchText',
'accessionNumber',
],
},
How one defines object similarity will vary from institution to institution. There are a number of approaches to querying Elasticsearch for similar documents, notably more_like_this
.
This project uses a custom bool query of boosted should terms. similarObjects.ts specifies which fields are used along with a boost value for each. The primary constituent (e.g. Artist, Maker, etc.) is given the most weight. These fields can be adjusted based on your institution's concept of object similarity. The current weights are:
primaryConstituent.name
- 4dynasty
- 2period
- 2classification
- 1.5medium
- 1collections
- 1exhibitions
- 1primaryGeographicalLocation.name
- 1
Based on https://github.com/shadcn/next-template (Website, UI Components), which is an implementation of Radix UI with Tailwind and other helpful utilities.
- Full-text search, including accession number
- API Endpoints for search & document retrieval
- Searchable filters
- Linked object properties
- Custom similarity algorithm with combined weighted terms (can be adjusted)
- Dominant color similarity using HSV color space.
- Embedded JSON-LD (Schema.org VisualArtwork) for better SEO and sharing
- Image Zoom with Openseadragon
- Image carousel with embla-carousel
- Form handling via Formspree
- Meta & OG meta tags
- lucide-react icons
- Tailwind CSS
- next-themes dark/light modes
- @next/font font loading
I've added CLIP Embeddings but there's no code in this project to add embeddings yourself. I've used the code here to add the embeddings via a Colab notebook, but it's a hack. Removed. See examples here.
It's hoped that all one will need to do is be able to export TMS data to JSON matching the format of the Elasticsearch index.
Searches can be performed against any index. Search requests are of the form:
GET http://localhost:3000/api/search/[index]?[querystring]
Querystring parameters are the same as those for the Web UI:
GET http://localhost:3000/api/search/collections?f=true&.name=George%20Bradford%20Brainerd
Document requests are of the form:
GET http://localhost:3000/api/[index]/[documentId]
For example, to get collection object #53453:
GET http://localhost:3000/api/collections/53453
You can run Elasticsearch in a Docker container, or sign up for an Elasticsearch Cloud account. For Docker, follow the instructions here. Sign up for an Elasticsearch Cloud account here.
Once you have a running Elasticsearch service, you can add the connection details to the environment variables.
For local development, add a local .env.local
file in the root directory. If ELASTICSEARCH_USE_CLOUD
is "true", the Elastic Cloud vars will be used, otherwise the _HOST, _PROTOCOL, _PORT, _CA_FILE, and _API_KEY vars will be used. You may need to copy the http_ca.crt from the Elasticsearch Docker container to a local directory like ./secrets
.
On Formspree you should set up a basic contact form and enter the FORMSPREE_FORM_ID
env variable.
For cloud deployments (for example on Vercel), add the same variables to the Environment Variables of your deployment.
DATASET=brooklynMuseum
ELASTICSEARCH_USE_CLOUD=true
ELASTICSEARCH_CLOUD_ID=elastic-museum-test:dXMtY2VudlasfdkjfdwLmNsb3VkLmVzLmlvOjQ0MyQ5ZDhiNWQ2NDM0NTA0ODgwadslfjk;ldfksjfdlNmE2M2IwMmaslfkjfdlksj2ZTU5MzZmMg==
ELASTICSEARCH_CLOUD_USERNAME=elastic
ELASTICSEARCH_CLOUD_PASSWORD=aslflsafdkjlkjslakdfj
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PROTOCOL=https
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_CA_FILE=./secrets/http_ca.crt
ELASTICSEARCH_API_KEY=DssaSLfdsFKJidsljfakslfjfLIJEWLiMkJPQzNwSzVmQQ==
ELASTICSEARCH_BULK_LIMIT=1000
FORMSPREE_FORM_ID=mskbksar
Fork/download this project and run npm i
to install dependencies.
Then, run the development server with npm run dev
and open http://localhost:3000 with your browser to see the result.
If you have not yet loaded the Elasticsearch data, you should see an error on the search page that the index does not exist.
From the command line, run: npm run import
The main data file with collection objects is ./data/BrooklynMuseum/collections.jsonl.gz
. importDataCommand.ts
will load compressed data from .jsonl.gz files in the data/BrooklynMuseum/
directory into Elasticsearch indices. Warning: This will modify Elasticsearch indices.
This command will:
- Load environment variables from
.env.local
- Ask if you want to proceed with the import
- Ask if you want to import the collections index (all records)
- Ask if you want to import the content index (all records)
- Ask if you want to import the archives index (all records)
- Ask if you want to update the terms index. Queries collections index for collections, classifications, and primaryConstituent fields, then adds unique values to the terms index.
- Ask if you want to update the ULAN terms index. Queries collections index for all unique primaryConstituent values, then searches ULAN data files for each name. If a match is found, ULAN data is added to the term.
- Ask if you want to update dominant colors. This will only update colors for images which haven't already been analyzed.
The import process will take some time, as it inserts 1000 documents at a time using Elasticsearch bulk and then rests for a couple seconds. There are about 100,000 documents in the collections dataset, 800 in content, and 31,000 in the archives dataset.
Licensed under the MIT license.
One should see 100's across the board for the Lighthouse score. Slightly lower score for performance due to relying on Brooklyn Museum image CDN.
Light mode example:
Dark mode example:
Color search example:
Object page example: