Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch / NC24 #699

Closed
fred-gb opened this issue May 16, 2022 · 11 comments · Fixed by nextcloud/files_fulltextsearch#191
Closed

Elasticsearch / NC24 #699

fred-gb opened this issue May 16, 2022 · 11 comments · Fixed by nextcloud/files_fulltextsearch#191

Comments

@fred-gb
Copy link

fred-gb commented May 16, 2022

Bonjour, 👋

All dockers:

  • Nextcloud 24.0
  • Elasticsearch 7.17.3 (with --batch ingest-attachment)
  • Tesseract in docker Nextcloud

All is correctly installed I think.
But When I try "Search app" results are empty.

occ fulltextsearch:test --platform_delay 3

.Testing your current setup:
Creating mocked content provider. ok
Testing mocked provider: get indexable documents. (2 items) ok
Loading search platform. (Elasticsearch) ok
Testing search platform. ok
Locking process ok
Removing test. ok
Pausing 3 seconds 1 2 3 ok
Initializing index mapping. ok
Indexing generated documents. ok
Pausing 3 seconds 1 2 3 ok
Retreiving content from a big index (license). (size: 32386) ok
Comparing document with source. ok
Searching basic keywords:
 - 'test' (result: 1, expected: ["simple"]) ok
 - 'document is a simple test' (result: 2, expected: ["simple","license"]) ok
 - '"document is a test"' (result: 0, expected: []) ok
 - '"document is a simple test"' (result: 1, expected: ["simple"]) ok
 - 'document is a simple -test' (result: 1, expected: ["license"]) ok
 - 'document is a simple +test' (result: 1, expected: ["simple"]) ok
 - '-document is a simple test' (result: 0, expected: []) ok
 - 'document is a simple +test +testing' (result: 1, expected: ["simple"]) ok
 - 'document is a simple +test -testing' (result: 0, expected: []) ok
 - 'document is a +simple -test -testing' (result: 0, expected: []) ok
 - '+document is a simple -test -testing' (result: 1, expected: ["license"]) ok
 - 'document is a +simple -license +testing' (result: 1, expected: ["simple"]) ok
Updating documents access. ok
Pausing 3 seconds 1 2 3 ok
Searching with group access rights:
 - 'license' - [] -  (result: 0, expected: []) ok
 - 'license' - ["group_1"] -  (result: 1, expected: ["license"]) ok
 - 'license' - ["group_1","group_2"] -  (result: 1, expected: ["license"]) ok
 - 'license' - ["group_3","group_2"] -  (result: 1, expected: ["license"]) ok
 - 'license' - ["group_3"] -  (result: 0, expected: []) ok
Searching with share rights:
 - 'license' - notuser -  (result: 0, expected: []) ok
 - 'license' - user2 -  (result: 1, expected: ["license"]) ok
 - 'license' - user3 -  (result: 1, expected: ["license"]) ok
Removing test. ok
Unlocking process ok

occ fulltextsearch:check

Full text search 24.0.0

- Search Platform:
Elasticsearch 24.0.0 (Selected)
{
    "elastic_host": [
        "http://elastic:********@127.0.0.1:9200"
    ],
    "elastic_index": "nextcloud",
    "fields_limit": "10000",
    "es_ver_below66": "0",
    "analyzer_tokenizer": "standard"
}

- Content Providers:
Deck 1.7.0
[]
Files 24.0.0
{
    "files_local": "1",
    "files_external": "1",
    "files_group_folders": "1",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "20",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_fulltextsearch_tesseract": {
        "version": "24.0.0",
        "enabled": "1",
        "psm": "4",
        "lang": "fra,eng",
        "pdf": "1",
        "pdf_limit": "0"
    }
}

occ fulltextsearch:index

┌─ Indexing  ────
│ Action: fillDocument
│ Provider: Files                Account: XXXXXXX
│ Document: 6
│ Info: application/pdf
│ Title: Documents/Nextcloud flyer.pdf
│ Content size:
│ Chunk:     10/44
│ Progress:      0/1
└──
┌─ Results ────
│ Result:     11/11
│ Index: deck:10
│ Status: ok
│ Message: {"_index":"nextcloud","_type":"_doc","_id":"deck:10","_version":1,"
│ result":"noop","_shards":{"total":0,"successful":0,"failed":0},"_seq_no":14,"
│ _primary_term":1}
└──
┌─ Errors ────
│ Error:      0/0
│ Index:
│ Exception:
│ Message:
│
│
└──
## x:first result ## c/v:prec/next result ## b:last result
## f:first error ## h/j:prec/next error ## d:delete error ## l:last error
## q:quit ## p:pause
An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php:814
Stack trace:
#0 /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php(747): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php(727): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /var/www/html/custom_apps/files_fulltextsearch/lib/Service/FilesService.php(618): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /var/www/html/custom_apps/files_fulltextsearch/lib/Provider/FilesProvider.php(288): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /var/www/html/custom_apps/fulltextsearch/lib/Service/IndexService.php(315): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /var/www/html/custom_apps/fulltextsearch/lib/Service/IndexService.php(195): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /var/www/html/custom_apps/fulltextsearch/lib/Command/Index.php(416): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'comacho', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /var/www/html/custom_apps/fulltextsearch/lib/Command/Index.php(279): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /var/www/html/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /var/www/html/core/Command/Base.php(168): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /var/www/html/3rdparty/symfony/console/Application.php(1009): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/3rdparty/symfony/console/Application.php(273): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/3rdparty/symfony/console/Application.php(149): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/lib/private/Console/Application.php(211): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/console.php(99): OC\Console\Application->run()
#15 /var/www/html/occ(11): require_once('/var/www/html/c...')

When I try in search page, I had no results when I check files. 😢

Can you help me?

Thanks!

@aaronriedel
Copy link

aaronriedel commented May 19, 2022

I get the same error while running the same setup:

    Nextcloud 24.0
    Elasticsearch 7.17.3 (with --batch ingest-attachment)
    Tesseract in docker Nextcloud

So it is definitely reproducible.

@fred-gb
Copy link
Author

fred-gb commented May 19, 2022

Hello,

A few second ago, I just finished something new.

Create Nextcloud Docker image with tesseract 5.

FROM nextcloud:24.0.0

RUN apt update -qq && \
      apt-get install -yqq --no-install-recommends \
      ca-certificates \
      lsb-release \ 
      software-properties-common \
      wget \
      curl \
      vim \
      libmagickcore-6.q16-6-extra \
      less \
      zip \
      smbclient \
      ghostscript \
      gnupg2

RUN echo "deb https://notesalexp.org/tesseract-ocr5/bullseye/ bullseye main" | tee -a /etc/apt/sources.list

RUN wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && apt update -qq

RUN apt install -y tesseract-ocr \
      imagemagick

RUN apt-get purge -yqq && rm -rf /var/lib/apt/lists/*

I don't get the error. But a new problem, when I launched indexing.

┌─ Indexing  ────
│ Action: compareWithCurrentIndex
│ Provider: Files                Account: USER
│ Document: 10
│ Info: /USER/files/Talk
│ Title:
│ Content size:
│ Chunk:      5/5
│ Progress:    all/1
└──
┌─ Results ────
│ Result:     11/11
│ Index: deck:10
│ Status: ok
│ Message: {"_index":"index","_type":"_doc","_id":"deck:10","_version":1,"resu
│ lt":"noop","_shards":{"total":0,"successful":0,"failed":0},"_seq_no":9,"_prim
│ ary_term":1}
└──
┌─ Errors ────
│ Error:      4/4
│ Index: files:2991
│ Exception: Elasticsearch\Common\Exceptions\NoNodesAvailableException
│ Message: unknown error
│
│
└──

With error 4/4 ?
And NoNodesAvailableException

But when I curl:

curl http://127.0.0.1:9200
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   546  100   546    0     0  27300      0 --:--:-- --:--:-- --:--:-- 28736
{
  "name" : "a358e612799e",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "bsxhTjKSSu2EvbsYrmBOSg",
  "version" : {
    "number" : "7.17.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "5ad023604c8d7416c9eb6c0eadb62b14e766caff",
    "build_date" : "2022-04-19T08:11:19.070913226Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

I get response from ElasticSearch.

In Search tab in Nextcloud Web interface, I have results now.

But does this errors and NoNodesAvailableException create somes problems in background?

Thanks

@fred-gb
Copy link
Author

fred-gb commented May 19, 2022

😞

I uploaded a new PDF on Nextcloud and initial error come back...

😭

@aaronriedel
Copy link

Same Error with Elasticsearch 8.2.0:

An unhandled exception has been thrown:
Error: Call to a member function getContent() on string in /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php:814
Stack trace:
#0 /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php(747): OCA\Files_FullTextSearch\Service\FilesService->updateContentFromFile('*** sensitive p...', Object(OC\Files\Node\File))
#1 /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php(727): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocumentFromFile(Object(OCA\Files_FullTextSearch\Model\FilesDocument), Object(OC\Files\Node\File))
#2 /var/www/html/apps/files_fulltextsearch/lib/Service/FilesService.php(618): OCA\Files_FullTextSearch\Service\FilesService->updateFilesDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#3 /var/www/html/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(288): OCA\Files_FullTextSearch\Service\FilesService->generateDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#4 /var/www/html/apps/fulltextsearch/lib/Service/IndexService.php(315): OCA\Files_FullTextSearch\Provider\FilesProvider->fillIndexDocument(Object(OCA\Files_FullTextSearch\Model\FilesDocument))
#5 /var/www/html/apps/fulltextsearch/lib/Service/IndexService.php(195): OCA\FullTextSearch\Service\IndexService->indexDocuments(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Array, Object(OCA\FullTextSearch\Model\IndexOptions))
#6 /var/www/html/apps/fulltextsearch/lib/Command/Index.php(416): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_Elasticsearch\Platform\ElasticSearchPlatform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'aaron', Object(OCA\FullTextSearch\Model\IndexOptions))
#7 /var/www/html/apps/fulltextsearch/lib/Command/Index.php(279): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Model\IndexOptions))
#8 /var/www/html/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /var/www/html/core/Command/Base.php(168): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /var/www/html/3rdparty/symfony/console/Application.php(1009): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /var/www/html/3rdparty/symfony/console/Application.php(273): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /var/www/html/3rdparty/symfony/console/Application.php(149): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /var/www/html/lib/private/Console/Application.php(211): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#14 /var/www/html/console.php(99): OC\Console\Application->run()
#15 /var/www/html/occ(11): require_once('/var/www/html/c...')
#16 {main}

Maybe Tesseract is the issue?

@fred-gb
Copy link
Author

fred-gb commented May 19, 2022

ElasticSearch 8 and Nextcloud are not compatible has I can see in other thread.

It seems Tesseract issue.

@aaronriedel
Copy link

I disabled tesseract in nextcloud and indexing works. I did not know that Elasticsearch 8 is not compatible. I downgraded to 7 again but still have the problem that file content is saved as seemingly useless gibberish, but that is a problem for another issue.

@martadinata666
Copy link

Alike #654 ?

@fred-gb
Copy link
Author

fred-gb commented May 23, 2022

Hello,
Yes, alike that. Maybe I need explanations.

When I scan a physical document and drop to NC, OCR is needed add to Search App.

So I need Tesseract OCR.

Do you have another solution for OCR?

Thanks

@apg1980
Copy link

apg1980 commented May 31, 2022

Please look as well here: #702 (comment)

@ArtificialOwl
Copy link
Member

It seems the issue is from files_fulltextsearch_tesseract.
Unfortunatly, I cannot reproduce the issue.
If anyone have a non-private pdf document that generate a getContent() on null, please send it to me: maxence@nextcloud.com

@ArtificialOwl
Copy link
Member

I were not able to reproduce the issue using the sent pdf. However, I added a check on the validity of the item returned by files_fts_tesseract. While it does not fix the document to be indexeable, it wont break the process (at least).

nextcloud/files_fulltextsearch#191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants