Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2640 - Add esingest package with base covid opensource dataset ingest #2655

Merged
merged 36 commits into from
Feb 6, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
6d7d082
Add es-ingest package scaffold
kbirk Feb 1, 2024
5ee9a7f
Fixes and updates
kbirk Feb 1, 2024
3ac847e
Get it running
kbirk Feb 1, 2024
c4e8509
Get ingest working
kbirk Feb 1, 2024
6246702
More fixes
kbirk Feb 1, 2024
108f701
Add some sweet backpressure
kbirk Feb 1, 2024
d2d2f97
Remove debug logs
kbirk Feb 1, 2024
dc0874a
Add things in config
kbirk Feb 1, 2024
e737385
Add gitignore
kbirk Feb 1, 2024
e72be61
Shutdown code
kbirk Feb 1, 2024
9a0a741
auto IntelliJ execute thingy
bigglesandginger Feb 1, 2024
5605426
Add blocking interface to run taskrunner tasks as a basic blocking re…
kbirk Feb 1, 2024
a188233
Merge branch '2640-opensource-ingest' of github.com:DARPA-ASKEM/terar…
kbirk Feb 1, 2024
bce9b4f
Remove unused file
kbirk Feb 1, 2024
f5c1ae8
Fixes to ingest and taskrunner
kbirk Feb 2, 2024
42f135b
Update ingest code to prevent conflicts
kbirk Feb 2, 2024
73eac2b
Fix issue with updates overwriting missing fields
kbirk Feb 2, 2024
ac6fb99
Fix knn search
kbirk Feb 2, 2024
68257b1
Add error handling
kbirk Feb 2, 2024
62473ea
Refactor, cleanup, and documentation
kbirk Feb 5, 2024
73aa84c
Revert change
kbirk Feb 5, 2024
0425053
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 5, 2024
e57663d
Disable tests
kbirk Feb 5, 2024
ac47124
Disable test
kbirk Feb 5, 2024
f6c371a
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 5, 2024
4ae29f6
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 5, 2024
0aa4c94
Fix the modelcard response wrapper.
kbirk Feb 5, 2024
a1179a8
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 6, 2024
29c1e88
Add configurable topics to ingest
kbirk Feb 6, 2024
f1502c1
Merge branch '2640-opensource-ingest' of github.com:DARPA-ASKEM/terar…
kbirk Feb 6, 2024
1dbcab9
Small updates
kbirk Feb 6, 2024
14e6707
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 6, 2024
4794f99
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 6, 2024
9ad6ead
Merge branch 'main' of github.com:DARPA-ASKEM/terarium into 2640-open…
kbirk Feb 6, 2024
0f27598
Merge branch '2640-opensource-ingest' of github.com:DARPA-ASKEM/terar…
kbirk Feb 6, 2024
ab45de7
Merge branch 'main' into 2640-opensource-ingest
kbirk Feb 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,16 @@
"args": [
"--spring.profiles.active=default,local"
]
},
{
"type": "java",
"name": "ElasticIngestApplication",
"request": "launch",
"mainClass": "software.uncharted.terarium.esingest.ElasticIngestApplication",
"projectName": "es-ingest",
"args": [
"--spring.profiles.active=default,local"
]
}
]
}
37 changes: 37 additions & 0 deletions packages/es-ingest/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
HELP.md
.gradle
build/
!gradle/wrapper/gradle-wrapper.jar
!**/src/main/**/build/
!**/src/test/**/build/

### STS ###
.apt_generated
.classpath
.factorypath
.project
.settings
.springBeans
.sts4-cache
bin/
!**/src/main/**/bin/
!**/src/test/**/bin/

### IntelliJ IDEA ###
.idea
*.iws
*.iml
*.ipr
out/
!**/src/main/**/out/
!**/src/test/**/out/

### NetBeans ###
/nbproject/private/
/nbbuild/
/dist/
/nbdist/
/.nb-gradle/

### VS Code ###
.vscode/
88 changes: 88 additions & 0 deletions packages/es-ingest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Terarium Elasticsearch Ingest

This package is designed to quickly import source documents along with their embeddings for knn semantic search in Elasticsearch.

## How to setup an ingest:

### Create the input class definitions:

An ingest requires an input `InputDocument` class that implements the `IInputDocument` interface and an `InputEmbeddingChunk` class that implements the `IInputEmbeddingChunk` interface.

```java
@Data
@JsonIgnoreProperties(ignoreUnknown = true)
public class ExampleDocument implements IInputDocument {
UUID id;
String title;
String body;
}

@Data
@JsonIgnoreProperties(ignoreUnknown = true)
public class ExampleEmbedding implements IInputEmbeddingChunk {
private UUID id;
private UUID embeddingChunkId;
private long[] spans;
private String title;
private double[] embedding;
}
```

### Create an `IElasticIngest` implementation:

Each ingest will require some logic to convert the `input` types to output types, this is done by implementing the `IElasticIngest` interface:

```java
public class ExampleIngest implements IElasticIngest<ExampleDocument, Document, ExampleEmbedding, EmbeddingChunk> {

ObjectMapper mapper = new ObjectMapper();

public Document processDocument(ExampleDocument input) {
Document doc = new Document();
doc.setId(input.getId());
doc.setTitle(input.getTitle());
doc.setFullText(input.getBody());
return doc;
}

public EmbeddingChunk processEmbedding(ExampleEmbedding input) {
Embedding embedding = new Embedding();
embedding.setEmbeddingId(input.getEmbeddingChunkId());
embedding.setSpans(input.getSpans());
embedding.setVector(input.getEmbedding());
EmbeddingChunk chunk = new EmbeddingChunk();
chunk.setId(input.getId());
chunk.setEmbedding(embedding);
return chunk;
}

public ExampleDocument deserializeDocument(String line) {
try {
return mapper.readValue(line, CovidDocument.class);
} catch (Exception e) {
throw new RuntimeException(e);
}
}

public ExampleEmbedding deserializeEmbedding(String line) {
try {
return mapper.readValue(line, CovidEmbedding.class);
} catch (Exception e) {
throw new RuntimeException(e);
}
}

}
```

### Configuring the ingest in `application.properties`:

Add an ingest entry to the `application.properties`:

```
terarium.esingest.ingestParams[0].name="A sample ingest"
terarium.esingest.ingestParams[0].inputDir=/path/to/source/dir
terarium.esingest.ingestParams[0].outputIndexRoot=example
terarium.esingest.ingestParams[0].ingestClass=software.uncharted.terarium.esingest.ingests.CovidIngest
terarium.esingest.ingestParams[0].clearBeforeIngest=true
```
48 changes: 48 additions & 0 deletions packages/es-ingest/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
plugins {
id 'java'
id 'org.springframework.boot' version '3.1.5'
id 'io.spring.dependency-management' version '1.1.4'
}

group = 'software.uncharted'
version = '1.0.0-SNAPSHOT'
sourceCompatibility = '17'

apply plugin: 'idea'

configurations {
compileOnly {
extendsFrom annotationProcessor
}
}

project.ext {
artifactName = 'es-ingest'
description = 'imports models into es'
}

repositories {
mavenCentral()
}

dependencies {
implementation 'org.springframework:spring-web'
implementation 'co.elastic.clients:elasticsearch-java:8.8.1'
implementation 'org.elasticsearch.client:elasticsearch-rest-high-level-client:7.17.1'
implementation 'org.springframework.boot:spring-boot-starter'
implementation 'com.fasterxml.jackson.core:jackson-databind:2.14.2'
compileOnly 'org.projectlombok:lombok'
developmentOnly 'org.springframework.boot:spring-boot-devtools'
annotationProcessor 'org.projectlombok:lombok'
testImplementation 'org.springframework.boot:spring-boot-starter-test'
testAnnotationProcessor 'org.projectlombok:lombok'
testCompileOnly 'org.projectlombok:lombok'
}

tasks.named('test') {
useJUnitPlatform()
}

dependencyLocking {
lockAllConfigurations()
}
kbirk marked this conversation as resolved.
Show resolved Hide resolved
Binary file not shown.
7 changes: 7 additions & 0 deletions packages/es-ingest/gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.5-bin.zip
networkTimeout=10000
validateDistributionUrl=true
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
Loading
Loading