Skip to content

Commit

Permalink
Merge pull request #201 from Inist-CNRS/services/data-rapido/create-d…
Browse files Browse the repository at this point in the history
…ata-rapido

[data-rapido] create data-rapido
  • Loading branch information
parmentf authored Nov 28, 2024
2 parents fcff985 + ca81642 commit c83b8b0
Show file tree
Hide file tree
Showing 39 changed files with 2,611 additions and 1,202 deletions.
6 changes: 0 additions & 6 deletions services/data-computer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,13 @@ USER root
RUN pip install \
gensim==4.3.2 \
spacy==3.6.1 \
spacy_lefff==0.5.1 \
en-core-web-sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0.tar.gz \
pandas==1.4.0 \
lxml==4.7.1 \
fr_core_news_sm@https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.6.0/fr_core_news_sm-3.6.0-py3-none-any.whl \
scipy==1.10.1 \
prometheus-client==0.19.0

# Install CURL to get spacy_lefff model and put it at the right place
RUN apt update && apt -y install curl tar
RUN curl -L -O https://github.com/sammous/spacy-lefff-model/releases/latest/download/model.tar.gz
RUN mkdir /usr/local/lib/python3.9/site-packages/spacy_lefff/data/tagger
RUN tar -xf model.tar.gz -C /usr/local/lib/python3.9/site-packages/spacy_lefff/data/tagger
WORKDIR /app/public

ENV NUMBA_CACHE_DIR=/tmp/numba_cache
Expand Down
65 changes: 0 additions & 65 deletions services/data-computer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,71 +178,6 @@ cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fe
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
```

### v1/rapido

Web service à destination du projet rapido. Ce web service prend en entrée un tar.gz comportant un dossier data incluant tout les documents xml à traiter. Il renvoit un json comportant les alignements que l'algorithme a pu faire entre le texte et le référentiel idRef.

Par exemple, en utilisant example-xml-rapido.tar.gz,
On obtiendra :

```json
{
"idArticle": "bch_0007-4217_2003_num_127_2_9424",
"title": "Aséa",
"sites": [],
"entite": [
{
"name": "ville basse",
"occurences": [
{
"page": "bch_0007-4217_2003_num_127_2_T1_0778_0000",
"text": " papamarinopoulos ( université de patras ) ont entrepris un projet commun de prospection géophysique dans la **ville basse** d' aséa dans le but de retrouver les sections de l' enceinte recouverte par une couche d' alluvions stériles déposées par l' alphée ."
}
],
"notice": "https://www.idref.fr/192337963.rdf",
"score": "PP(0)"
},
{
"name": "patras",
"occurences": [
{
"page": "bch_0007-4217_2003_num_127_2_T1_0778_0000",
"text": " papamarinopoulos ( université de **patras** ) ont entrepris un projet commun de prospection géophysique dans la ville basse d' aséa dans le but de retrouver les sections de l' enceinte recouverte par une couche d' alluvions stériles déposées par l' alphée ."
},
{
"page": "bch_0007-4217_2003_num_127_2_T1_0778_0000",
"text": " les données recueillies en 2002 ont été traitées au laboratoire de géophysique du département de géologie de l' université de **patras** ."
}
],
"notice": "https://www-dev.idref.fr/050189484.rdf",
"score": "PP(0)"
}
]
}
```

#### Paramètre(s) URL

| nom | description |
| ------------------- | ------------------------------------------- |
| indent (true/false) | Indenter le résultat renvoyer immédiatement |

#### Entête(s) HTTP

| nom | description |
| ------ | ------------------------------------------------------------ |
| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) |

#### Exemple en ligne de commande

```bash
# Send data for batch processing
cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/rapido" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
```

### v1/corpus-similarity

Compare des petits documents (Titre, phrases, petits *abstracts*) entre eux, et renvoie pour chaque document les documents qui lui sont similaires.
Expand Down
10 changes: 0 additions & 10 deletions services/data-computer/examples.http
Original file line number Diff line number Diff line change
Expand Up @@ -105,16 +105,6 @@ X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
< ./example-json.tar.gz


###
# @name v1Rapido
POST {{host}}/v1/rapido HTTP/1.1
Content-Type: application/x-tar
X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-xml-rapido.tar.gz


###
# @name v1Small
POST {{host}}/v1/small HTTP/1.1
Expand Down
1 change: 0 additions & 1 deletion services/data-computer/tests.hurl
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,6 @@ HTTP 200


# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
# TODO: ajouter la route rapido

##################################### group-by ######################
POST {{host}}/v1/group-by
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit c83b8b0

Please sign in to comment.