Add language detection to REST API #659

UnniKohonen · 2022-12-29T12:42:07Z

This PR adds the ability to detect the language of a text to the REST API. The language detection uses the simplemma python library.

A POST method is added to the end-point /detect-language. It expects the request body to include a json object with the text whose language is to be detected and a list of candidate languages as their IETF BCP 47 ~~ISO 639-1 codes~~. For example:

{
  "languages": ["en", "fi"],
  "text": "A quick brown fox jumped over the lazy dog."
}

The response is a json object with the format:

{
  "results": [
    {"language": "en", "score": 0.85},
    {"language": "fi", "score": 0.1},
    {"language": null, "score": 0.1} 
  ]
}

where the scores range from 0 to 1 and a null value is used for an unknown language.

Implements REST API part of #631.

@juhoinkinen: I edited this for the latest changes (parameter name change candidates -> languages).

codecov · 2022-12-29T12:46:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.65%. Comparing base (337ee70) to head (36b479a).
Report is 19 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #659   +/-   ##
=======================================
  Coverage   99.64%   99.65%           
=======================================
  Files          91       93    +2     
  Lines        6831     6889   +58     
=======================================
+ Hits         6807     6865   +58     
  Misses         24       24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

juhoinkinen · 2023-01-02T10:47:44Z

I see this is only a draft at the moment, but I took a glance and I think it would be better that the end-point name had hyphen instead of underscore (/detect_language - > /detect-language), it seems to be the preferred convention.

osma

Looks like a good start. I gave some comments on individual code lines. In addition to those:

black formatting should be applied (see details)
there should be a unit test in tests/test_rest.py which exercises the detect_language method

annif/openapi/annif.yaml

annif/rest.py

osma · 2023-01-16T08:26:38Z

Thanks for adding tests. A few more things:

There should be tests for special cases, e.g. empty input, no candidate languages, unknown or malformed candidate languages...
What if there are 20 candidate languages? How much memory will it take to handle the query? Will the memory be released afterwards? (I think not, Simplemma keeps models in memory AFAIK)

UnniKohonen · 2023-01-17T14:03:09Z

Right now, when making a request with no candidates or unknown candidates, the endpoint returns an empty list and when making a request with no text, it returns { "language": null, "score": 1 }. Does this make sense? Or should it always return the unknown language with score 1 when the input is incorrect?

I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates.

osma · 2023-01-18T08:38:18Z

Right now, when making a request with no candidates or unknown candidates, the endpoint returns an empty list

The good news is that it's not crashing! 😁

My opinions on these cases:

With no candidates given, I think it would be good to return a 400 Bad Request status code, with a descriptive error message.
With unknown candidates (language codes that are unrecognized/unsupported by simplemma), I think a 400 Bad Request with a descriptive error message would also be appropriate.

There should be unit tests to check that these are indeed the results.

and when making a request with no text, it returns { "language": null, "score": 1 }. Does this make sense? Or should it always return the unknown language with score 1 when the input is incorrect?

I think this is OK, but there should also be a unit test for this special case.

I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates.

Great, thanks for testing! This is mostly what I suspected, although it's a surprise that accessing the endpoint again will free the memory. (Maybe this has to do with Flask running in development mode?)

This has some potential for DoS situations (intended or not), but I guess it's hard for us to avoid that given how Simplemma works. We could, however, limit the number of candidate languages per request to, say, at most 5. What do others think? @juhoinkinen ?

We could also try to work with the Simplemma maintainer if we want to change the way Simplemma allocates and releases models. For example, it could be possible to ask Simplemma to release the memory immediately or after a set period like 60 seconds after use.

juhoinkinen · 2023-01-18T09:25:13Z

I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates.

Great, thanks for testing! This is mostly what I suspected, although it's a surprise that accessing the endpoint again will free the memory. (Maybe this has to do with Flask running in development mode?)

This has some potential for DoS situations (intended or not), but I guess it's hard for us to avoid that given how Simplemma works. We could, however, limit the number of candidate languages per request to, say, at most 5. What do others think? @juhoinkinen ?

Limiting the number of candidate languages seems reasonable. If there is no simple way to make the limit configurable, 5 could be a good number for that.

We could also try to work with the Simplemma maintainer if we want to change the way Simplemma allocates and releases models. For example, it could be possible to ask Simplemma to release the memory immediately or after a set period like 60 seconds after use.

I noticed there is an issue in Simplemma repository about loading models to memory, which was opened just yesterday.

sonarqubecloud · 2023-01-19T12:09:29Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
0.0% Duplication

juhoinkinen · 2023-01-19T12:49:02Z

Just started to think, if some some testing could be performed also in tests/test_swagger.py. I don't remember just what more functionality does the tests in test_swagger.py cover than those in test_rest.py (if any).

juhoinkinen · 2023-01-19T14:24:22Z

Just started to think, if some some testing could be performed also in tests/test_swagger.py. I don't remember just what more functionality does the tests in test_swagger.py cover than those in test_rest.py (if any).

Background to the question of test_swagger.py vs test_rest.py: #551 (comment)

osma · 2024-09-16T10:55:10Z

Rebased on PR #724, adapted to use annif.simplemma_util and Connexion 3 support, force-pushed.

… name to languages

sonarqubecloud · 2024-09-16T13:51:15Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

juhoinkinen

I tested this with some quick ways, works nicely!

UnniKohonen added the enhancement label Dec 29, 2022

osma requested changes Jan 9, 2023

View reviewed changes

annif/openapi/annif.yaml Outdated Show resolved Hide resolved

annif/openapi/annif.yaml Outdated Show resolved Hide resolved

annif/openapi/annif.yaml Outdated Show resolved Hide resolved

annif/rest.py Outdated Show resolved Hide resolved

osma mentioned this pull request Jan 19, 2023

Refactor/better caching strategy adbar/simplemma#34

Closed

juhoinkinen linked an issue May 11, 2023 that may be closed by this pull request

Language detection method in REST API & CLI #631

Closed

osma mentioned this pull request Aug 9, 2023

Plans for simplemma 1.0 release? adbar/simplemma#110

Closed

osma added 5 commits September 16, 2024 11:44

use simplemma github main branch instead of last release 0.9.1

5d18bfa

limit number of in-memory Simplemma dictionaries to at most 5

2c3e2c3

access simplemma functionality only via annif.simplemma_util

07c2b59

upgrade to Simplemma 1.0

d0fa432

upgrade to simplemma 1.1.1

b66279e

osma mentioned this pull request Sep 16, 2024

Upgrade Simplemma & limit its memory usage #724

Merged

enable using get_language_detector with many languages + add unit test

4a69662

osma force-pushed the issue631-rest-api-language-detection branch from 34c2538 to 1cd8003 Compare September 16, 2024 10:53

osma force-pushed the issue631-rest-api-language-detection branch from 1cd8003 to 7ccbbf0 Compare September 16, 2024 12:09

osma and others added 6 commits September 16, 2024 15:14

use pytest.approx instead of comparing float values directly

61c9409

Add language detection to REST API

d5c8677

Use a json object in request body

9ef4680

Change endpoint name

5673dce

Add unit tests

c34369e

Fix OpenAPI spec

2ed2839

UnniKohonen and others added 4 commits September 16, 2024 15:18

Fix for loop

b32462e

Add error status codes

51b307e

Add unit tests for bad requests

356c88c

adapt to annif.simplemma_util and newer Connexion

065bfeb

osma force-pushed the issue631-rest-api-language-detection branch from 7ccbbf0 to 065bfeb Compare September 16, 2024 12:19

osma marked this pull request as ready for review September 16, 2024 12:19

limit number of candidate languages on schema level; change parameter…

36b479a

… name to languages

osma force-pushed the issue631-rest-api-language-detection branch from c37ee41 to 36b479a Compare September 16, 2024 13:50

osma requested a review from juhoinkinen September 16, 2024 13:59

juhoinkinen added this to the 1.2 milestone Sep 17, 2024

juhoinkinen approved these changes Sep 17, 2024

View reviewed changes

juhoinkinen merged commit c42a93f into main Sep 17, 2024
17 checks passed

juhoinkinen deleted the issue631-rest-api-language-detection branch September 17, 2024 07:57

This was referenced Sep 17, 2024

Language detection method in CLI #799

Closed

Sort language detection results by descending score #800

Merged

juhoinkinen mentioned this pull request Sep 27, 2024

Detect language of text NatLibFi/FintoAI#9

Open

juhoinkinen mentioned this pull request Nov 5, 2024

Detect language of text with web UI #815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add language detection to REST API #659

Add language detection to REST API #659

UnniKohonen commented Dec 29, 2022 •

edited by juhoinkinen

Loading

codecov bot commented Dec 29, 2022 •

edited

Loading

juhoinkinen commented Jan 2, 2023

osma left a comment

osma commented Jan 16, 2023

UnniKohonen commented Jan 17, 2023

osma commented Jan 18, 2023

juhoinkinen commented Jan 18, 2023 •

edited

Loading

sonarqubecloud bot commented Jan 19, 2023

juhoinkinen commented Jan 19, 2023

juhoinkinen commented Jan 19, 2023

osma commented Sep 16, 2024

sonarqubecloud bot commented Sep 16, 2024

juhoinkinen left a comment

Add language detection to REST API #659

Add language detection to REST API #659

Conversation

UnniKohonen commented Dec 29, 2022 • edited by juhoinkinen Loading

codecov bot commented Dec 29, 2022 • edited Loading

Codecov Report

juhoinkinen commented Jan 2, 2023

osma left a comment

Choose a reason for hiding this comment

osma commented Jan 16, 2023

UnniKohonen commented Jan 17, 2023

osma commented Jan 18, 2023

juhoinkinen commented Jan 18, 2023 • edited Loading

sonarqubecloud bot commented Jan 19, 2023

juhoinkinen commented Jan 19, 2023

juhoinkinen commented Jan 19, 2023

osma commented Sep 16, 2024

sonarqubecloud bot commented Sep 16, 2024

Quality Gate passed

juhoinkinen left a comment

Choose a reason for hiding this comment

UnniKohonen commented Dec 29, 2022 •

edited by juhoinkinen

Loading

codecov bot commented Dec 29, 2022 •

edited

Loading

juhoinkinen commented Jan 18, 2023 •

edited

Loading