Skip to content

Commit

Permalink
Build reference set indices in a separate pass
Browse files Browse the repository at this point in the history
In order to ensure correct behaviour, import in three phases - core components, interim indexing, extended components,. This likely results in some duplicate indexing, but ensures correct update-in-place operation and correct handling of inactive reference set items. When the user initiates indexing, component indices are then completely re-built, together with search and reference set member indices.
  • Loading branch information
wardle committed Nov 8, 2022
1 parent d934e28 commit 63af017
Show file tree
Hide file tree
Showing 6 changed files with 116 additions and 72 deletions.
32 changes: 18 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,14 @@ My tiny i5 'NUC' machine takes 1 minute to import the UK edition of SNOMED CT an
dictionary
of medicines and devices.

4. Compact and index
4. Index and compact

```shell
clj -M:run --db snomed.db compact
clj -M:run --db snomed.db index
clj -M:run --db snomed.db compact
```

My machine takes 20 seconds to compact the database and 6 minutes to build the search indices.
My machine takes 6 minutes to build the search indices and 20 seconds to compact the database.

5. Run a server!

Expand Down Expand Up @@ -414,38 +414,42 @@ clj -M:run --db snomed.db import ~/Downloads/snomed-2020/
The import of both International and UK distribution files takes
a total of less than 3 minutes on my machine.

#### 2. Compact database (optional).
#### 2. Index

This reduces the file size and takes 20 seconds.
This is an optional step, but recommended.
For correct operation, indices are needed for components, search and reference
set membership.

Run

```shell
java -jar hermes.jar --db snomed.db compact
java -jar hermes.jar --db snomed.db index
```

or

```shell
clj -M:run --db snomed.db compact
clj -M:run --db snomed.db index
```

Unlike prior versions, you do not need to give java more heap.
This will build the indices; it takes about 6 minutes on my machine.

#### 3. Build search index
#### 3. Compact database (optional).

Run
This reduces the file size and takes 20 seconds.
This is an optional step, but recommended.

```shell
java -jar hermes.jar --db snomed.db index
java -jar hermes.jar --db snomed.db compact
```

or

```shell
clj -M:run --db snomed.db index
clj -M:run --db snomed.db compact
```

This will build the search indices; it takes about 6 minutes on my machine.
Unlike prior versions, you do not need to give java more heap.


#### 4. Run a REPL (optional)

Expand Down
4 changes: 2 additions & 2 deletions cmd/com/eldrix/hermes/cmd/core.clj
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@
(defn build-index [{:keys [db locale]} _]
(if db
(if (str/blank? locale)
(hermes/build-search-indices db)
(hermes/build-search-indices db locale))
(hermes/index db)
(hermes/index db locale))
(log/error "no database directory specified")))

(defn compact [{:keys [db]} _]
Expand Down
51 changes: 31 additions & 20 deletions src/com/eldrix/hermes/core.clj
Original file line number Diff line number Diff line change
Expand Up @@ -721,47 +721,58 @@

(defn import-snomed
"Import SNOMED distribution files from the directories `dirs` specified into
the database directory `root` specified. Import is performed in two phases
for each directory - firstly core components and essential metadata, and
secondly non-core and extension files. Finally, store indices are re-built"
the database directory `root` specified.
Import is performed in three phases for each directory:
1. import of core components and essential metadata, and
2. interim indexing
3. import of non-core and extension files.
Interim indexing is necessary in order to ensure correct reification in
subsequent import(s)."
[root dirs]
(let [manifest (open-manifest root true)
store-filename (get-absolute-filename root (:store manifest))]
(doseq [dir dirs]
(log-metadata dir)
(let [files (importer/importable-files dir)]
(do-import-snomed store-filename (->> files (filter #(core-components (:component %)))))
(do-import-snomed store-filename (->> files (remove #(core-components (:component %)))))
(with-open [st (store/open-store store-filename {:read-only? false})]
(log/info "Rebuilding store indices...")
(store/build-indices st)
(log/info "Rebuilding store indices... completed"))))))
(store/index st))
(do-import-snomed store-filename (->> files (remove #(core-components (:component %)))))))))

(defn compact
[root]
(let [manifest (open-manifest root false)]
(log/info "Compacting database at " root "...")
(with-open [st (store/open-store (get-absolute-filename root (:store manifest)) {:read-only? false})]
(store/compact st))
(log/info "Compacting database... complete.")))
(log/info "Compacting database... complete")))

(defn build-search-indices
([root] (build-search-indices root (.toLanguageTag (Locale/getDefault))))
(defn index
([root] (index root (.toLanguageTag (Locale/getDefault))))
([root language-priority-list]
(let [manifest (open-manifest root false)]
(log/info "Building indices" {:root root :languages language-priority-list})
(log/info "Building search index")
(search/build-search-index (get-absolute-filename root (:store manifest))
(get-absolute-filename root (:search manifest))
language-priority-list)
(let [manifest (open-manifest root false)
store-filename (get-absolute-filename root (:store manifest))
search-filename (get-absolute-filename root (:search manifest))
members-filename (get-absolute-filename root (:members manifest))]
(log/info "Indexing..." {:root root})
(log/info "Building component index")
(with-open [st (store/open-store store-filename {:read-only? false})]
(store/index st))
(log/info "Building search index" {:languages language-priority-list})
(search/build-search-index store-filename search-filename language-priority-list)
(log/info "Building members index")
(members/build-members-index (get-absolute-filename root (:store manifest))
(get-absolute-filename root (:members manifest)))
(log/info "Building indices... complete."))))
(members/build-members-index store-filename members-filename)
(log/info "Indexing... complete"))))

(def ^:deprecated build-search-indices
"DEPRECATED: Use [[build-indices]] instead"
index)

(def ^:deprecated build-search-index
"DEPRECATED: Use [[build-search-indices]] instead"
build-search-indices)
index)

(defn get-status [root & {:keys [counts? installed-refsets?] :or {counts? false installed-refsets? true}}]
(let [manifest (open-manifest root)]
Expand Down
87 changes: 56 additions & 31 deletions src/com/eldrix/hermes/impl/lmdb.clj
Original file line number Diff line number Diff line change
Expand Up @@ -180,10 +180,7 @@

(defn write-relationships
"Each relationship is stored as an entity in the 'relationships' db, keyed
by a relationship-id.
Each *active* relationship is referenced in the 'conceptParentRelationships'
and 'conceptChildRelationships' indices,"
by a relationship-id."
[^LmdbStore store relationships]
(with-open [txn (.txnWrite ^Env (.-coreEnv store))]
(let [db ^Dbi (.-relationships store)
Expand All @@ -198,7 +195,9 @@
(.commit txn)
(finally (.release kb) (.release vb))))))

(defn drop-relationships-index [^LmdbStore store]
(defn drop-relationships-index
"Deletes all indices relating to relationships."
[^LmdbStore store]
(with-open [txn ^Txn (.txnWrite ^Env (.-coreEnv store))]
(let [parent-idx ^Dbi (.-conceptParentRelationships store)
child-idx ^Dbi (.-conceptChildRelationships store)]
Expand All @@ -207,7 +206,9 @@
(.commit txn)))

(defn index-relationships
"Iterates all active relationships and rebuilds parent and child indices."
"Iterates all active relationships and rebuilds parent and child indices.
Each *active* relationship is referenced in the 'conceptParentRelationships'
and 'conceptChildRelationships' indices."
[^LmdbStore store]
(with-open [write-txn ^Txn (.txnWrite ^Env (.-coreEnv store))
read-txn ^Txn (.txnRead ^Env (.-coreEnv store))
Expand Down Expand Up @@ -247,49 +248,75 @@
"Each reference set item is stored as an entity in the 'refsetItems' db, keyed
by the UUID, a tuple of msb and lsb.
Each *active* item is indexed:
- refsetFieldNames : refset-id -- field-names (an array of strings)
- componentRefsets : referencedComponentId -- refsetId -- msb -- lsb
- associations : targetComponentId -- refsetId -- referencedComponentId - msb - lsb"
During import, an index of refset field names is created:
- refsetFieldNames : refset-id -- field-names (an array of strings)"
[^LmdbStore store headings items]
(with-open [core-txn (.txnWrite ^Env (.-coreEnv store))
refsets-txn (.txnWrite ^Env (.-refsetsEnv store))]
(let [items-db ^Dbi (.-refsetItems store)
components-db ^Dbi (.-componentRefsets store)
assocs-db ^Dbi (.-associations store)
item-kb (.directBuffer (PooledByteBufAllocator/DEFAULT) 16) ;; a UUID - 16 bytes
vb (.directBuffer (PooledByteBufAllocator/DEFAULT) 512)
component-kb (.directBuffer (PooledByteBufAllocator/DEFAULT) 32) ;; referencedComponentId -- refsetId -- msb -- lsb
assoc-kb (.directBuffer (PooledByteBufAllocator/DEFAULT) 40) ;; targetComponentId -- refsetId -- referencedComponentId - msb - lsb
idx-val (.directBuffer (PooledByteBufAllocator/DEFAULT) 0)]
vb (.directBuffer (PooledByteBufAllocator/DEFAULT) 512)]
(try
(loop [items' items refset-ids #{}]
(when-let [item (first items')]
(when-not (contains? refset-ids (:refsetId item))
(write-refset-headings store refsets-txn (:refsetId item) headings))
(let [msb (.getMostSignificantBits ^UUID (:id item))
lsb (.getLeastSignificantBits ^UUID (:id item))
target-id (:targetComponentId item)]
lsb (.getLeastSignificantBits ^UUID (:id item))]
(doto item-kb .clear (.writeLong msb) (.writeLong lsb))
(doto component-kb .clear (.writeLong (:referencedComponentId item)) (.writeLong (:refsetId item)) (.writeLong msb) (.writeLong lsb))
(when (should-write-object? items-db refsets-txn item-kb 17 (:effectiveTime item)) ;; skip a 17 byte key (type-msb-lsb; type = 1 byte, msb = 8 bytes, lsb = 8 bytes)
(.clear vb)
(ser/write-refset-item vb item)
(.put items-db refsets-txn item-kb vb put-flags)
(when target-id
(doto assoc-kb .clear (.writeLong target-id) (.writeLong (:refsetId item)) (.writeLong (:referencedComponentId item)) (.writeLong msb) (.writeLong lsb)))
(if (:active item)
(do (.put components-db core-txn component-kb idx-val put-flags)
(when target-id (.put assocs-db core-txn assoc-kb idx-val put-flags)))
(do (.delete components-db core-txn component-kb)
(when target-id (.delete assocs-db core-txn assoc-kb))))))
(.put items-db refsets-txn item-kb vb put-flags)))
(recur (next items') (conj refset-ids (:refsetId item)))))
(.commit core-txn)
(.commit refsets-txn)
(finally (.release item-kb) (.release vb) (.release component-kb) (.release assoc-kb) (.release idx-val))))))
(finally (.release item-kb) (.release vb))))))

(defn drop-refset-indices
"Delete all indices relating to reference set items."
[^LmdbStore store]
(with-open [txn ^Txn (.txnWrite ^Env (.-coreEnv store))]
(let [components-db ^Dbi (.-componentRefsets store)
assocs-db ^Dbi (.-associations store)]
(.drop components-db txn)
(.drop assocs-db txn))
(.commit txn)))

(defn index-refsets
"Iterates all active reference set items and rebuilds indices.
Each *active* item is indexed:
- componentRefsets : referencedComponentId -- refsetId -- msb -- lsb
- associations : targetComponentId -- refsetId -- referencedComponentId - msb - lsb"
[^LmdbStore store]
(with-open [write-txn ^Txn (.txnWrite ^Env (.-coreEnv store))
read-txn ^Txn (.txnRead ^Env (.-refsetsEnv store))
cursor (.openCursor ^Dbi (.-refsetItems store) read-txn)]
(let [components-db ^Dbi (.-componentRefsets store)
assocs-db ^Dbi (.-associations store)
component-kb (.directBuffer (PooledByteBufAllocator/DEFAULT) 32) ;; referencedComponentId -- refsetId -- msb -- lsb
assoc-kb (.directBuffer (PooledByteBufAllocator/DEFAULT) 40) ;; targetComponentId -- refsetId -- referencedComponentId - msb - lsb
idx-val (.directBuffer (PooledByteBufAllocator/DEFAULT) 0)]
(try
(loop [continue? (.first cursor)]
(when continue?
(let [item (ser/read-refset-item (.val cursor))
msb (.getMostSignificantBits ^UUID (:id item))
lsb (.getLeastSignificantBits ^UUID (:id item))
target-id (:targetComponentId item)]
(when (:active item)
(doto component-kb .clear (.writeLong (:referencedComponentId item)) (.writeLong (:refsetId item)) (.writeLong msb) (.writeLong lsb))
(.put components-db write-txn component-kb idx-val put-flags)
(when target-id
(doto assoc-kb .clear (.writeLong target-id) (.writeLong (:refsetId item)) (.writeLong (:referencedComponentId item)) (.writeLong msb) (.writeLong lsb))
(.put assocs-db write-txn assoc-kb idx-val put-flags))))
(.resetReaderIndex ^ByteBuf (.val cursor))
(recur (.next cursor))))
(.commit write-txn)
(finally (.release component-kb) (.release assoc-kb) (.release idx-val))))))

(defn stream-all
"Stream all values to the channel specified. "
"Stream all values from the specified dbi to the channel specified. "
([^Env env ^Dbi dbi ch read-fn]
(stream-all env dbi ch read-fn true))
([^Env env ^Dbi dbi ch read-fn close?]
Expand All @@ -303,7 +330,6 @@
(recur (.next cursor)))
(when close? (clojure.core.async/close! ch))))))))


(defn get-object [^Env env ^Dbi dbi ^long id read-fn]
(with-open [txn (.txnRead env)]
(let [kb (.directBuffer (PooledByteBufAllocator/DEFAULT) 8)]
Expand Down Expand Up @@ -400,7 +426,6 @@
(ser/read-refset-item vb))
(finally (.release kb)))))))


(defn get-raw-parent-relationships
"Return the parent relationships of the given concept.
Returns a list of tuples (from--type--group--to)."
Expand Down
6 changes: 4 additions & 2 deletions src/com/eldrix/hermes/impl/store.clj
Original file line number Diff line number Diff line change
Expand Up @@ -473,9 +473,11 @@
(write-batch-one-by-one batch store))))


(defn build-indices [store]
(defn index [store]
(kv/drop-relationships-index store)
(kv/index-relationships store))
(kv/drop-refset-indices store)
(kv/index-relationships store)
(kv/index-refsets store))

(defmulti is-a? (fn [_store concept _parent-id] (class concept)))

Expand Down
8 changes: 5 additions & 3 deletions test/com/eldrix/hermes/store_test.clj
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
(store/write-batch {:type :info.snomed/Concept :data concepts} st)
(store/write-batch {:type :info.snomed/Description :data descriptions} st)
(store/write-batch {:type :info.snomed/Relationship :data relationships} st)
(store/build-indices st)
(store/index st)
(testing "Concept read/write"
(is (every? true? (map #(= % (store/get-concept st (:id %))) concepts))))
(testing "Concept descriptions"
Expand All @@ -85,11 +85,11 @@
r2 (gen/generate (rf2/gen-relationship {:sourceId 1089261000000101 :destinationId 213345000 :typeId 116680003 :active true :effectiveTime (java.time.LocalDate/of 2021 5 12)}))]
(with-open [st (store/open-store)]
(store/write-batch {:type :info.snomed/Relationship :data [r1 r2]} st)
(store/build-indices st)
(store/index st)
(is (= {116680003 #{213345000}} (store/get-parent-relationships st 1089261000000101))))
(with-open [st (store/open-store)]
(store/write-batch {:type :info.snomed/Relationship :data [r2 r1]} st)
(store/build-indices st)
(store/index st)
(is (= {116680003 #{213345000}} (store/get-parent-relationships st 1089261000000101))
"Different relationships with same source, target and type identifiers should result in indices deterministically, not on basis of import order"))))

Expand All @@ -105,6 +105,7 @@
(store/write-batch {:type :info.snomed/Concept :data [refset]} st)
(store/write-batch {:type :info.snomed/Concept :data concepts} st)
(store/write-batch {:type :info.snomed/SimpleRefset :data refset-items} st)
(store/index st)
(is (= #{refset-id} (store/get-installed-reference-sets st)))
(dorun (map #(is (= % (store/get-refset-item st (:id %)))) refset-items))
(is (every? true? (map #(= #{refset-id} (store/get-component-refset-ids st (:id %))) members)))
Expand Down Expand Up @@ -136,6 +137,7 @@
(with-open [store (store/open-store)]
(store/write-batch {:type :info.snomed/Concept :data [refset-concept]} store)
(store/write-batch {:type :info.snomed/RefsetDescriptorRefset :data [rd1 rd2]} store)
(store/index store)
(is (= rd1 (store/get-refset-item store (:id rd1))))
(is (= rd2 (store/get-refset-item store (:id rd2))))
(is (= (list rd1 rd2) (store/get-refset-descriptors store 1322291000000109)))
Expand Down

0 comments on commit 63af017

Please sign in to comment.