Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add khmer lang #2012

Merged
merged 12 commits into from
Apr 5, 2023
Merged

add khmer lang #2012

merged 12 commits into from
Apr 5, 2023

Conversation

xshadowlegendx
Copy link
Contributor

@xshadowlegendx xshadowlegendx commented Mar 16, 2023

this pr contains

  • update to solr in docker-compose.yml to enable analysis-extras module for icu tokenizer
  • add Khmer lang according to the guide but skip the adding test step
  • update docs about solr in https://docspell.org/docs/install/prereq/ that also need extra-analysis module in order for text analysis to be working for khmer language
  • update joex.dockerfile to add khmer font and tesseract khmer model

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 16, 2023

currently I am trying to run test by doing sbt testOnly docspell.analysis.date.DateFindTest, but hitting timeout errors on migration, but sbt Test/compile completed without errors

2023.03.16 23:26:21:0000 [io-comp...] [ERROR] org.flywaydb.core.internal.jdbc.TransactionalExecutionTemplate - Unable to rollback transaction
2023.03.16 23:26:21:0001 [io-comp...] [ERROR] org.flywaydb.core.internal.jdbc.TransactionalExecutionTemplate - Unable to restore autocommit to original value for connection
2023.03.16 23:26:21:0002 [io-comp...] [ERROR] org.flywaydb.core.internal.jdbc.TransactionalExecutionTemplate - Unable to rollback transaction
2023.03.16 23:26:21:0003 [io-comp...] [ERROR] org.flywaydb.core.internal.jdbc.TransactionalExecutionTemplate - Unable to restore autocommit to original value for connection
2023.03.16 23:26:28:0000 [io-comp...] [ERROR] org.flywaydb.core.internal.jdbc.TransactionalExecutionTemplate - Unable to rollback transaction
2023.03.16 23:26:28:0001 [io-comp...] [ERROR] org.flywaydb.core.internal.jdbc.TransactionalExecutionTemplate - Unable to restore autocommit to original value for connection
2023.03.16 23:26:28:0002 [io-comp...] [ERROR] org.flywaydb.database.mysql.MySQLNamedLockTemplate - Unable to release MySQL named lock: Flyway--562162009
==> X docspell.store.migrate.MigrateTest.postgres empty schema migration  31.126s java.util.concurrent.TimeoutException: Future timed out after [30 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.tryAwait0(Promise.scala:248)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:261)
    at scala.concurrent.Await$.$anonfun$result$1(package.scala:201)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:62)
    at scala.concurrent.Await$.result(package.scala:124)
    at munit.internal.PlatformCompat$.$anonfun$waitAtMost$1(PlatformCompat.scala:21)
    at scala.util.Try$.apply(Try.scala:210)
    at munit.internal.PlatformCompat$.waitAtMost(PlatformCompat.scala:21)
    at munit.FunSuite.waitForCompletion(FunSuite.scala:51)
    at munit.FunSuite.$anonfun$test$1(FunSuite.scala:37)
    at munit.MUnitRunner.$anonfun$runTestBody$1(MUnitRunner.scala:296)
==> X docspell.store.migrate.MigrateTest.mariadb empty schema migration  57.6s java.util.concurrent.TimeoutException: Future timed out after [30 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.tryAwait0(Promise.scala:248)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:261)
    at scala.concurrent.Await$.$anonfun$result$1(package.scala:201)
    at cats.effect.unsafe.WorkerThread.blockOn(WorkerThread.scala:676)
    at scala.concurrent.Await$.result(package.scala:124)
    at munit.internal.PlatformCompat$.$anonfun$waitAtMost$1(PlatformCompat.scala:21)
    at scala.util.Try$.apply(Try.scala:210)
    at munit.internal.PlatformCompat$.waitAtMost(PlatformCompat.scala:21)
    at munit.FunSuite.waitForCompletion(FunSuite.scala:51)
    at munit.FunSuite.$anonfun$test$1(FunSuite.scala:37)
    at munit.MUnitRunner.$anonfun$runTestBody$1(MUnitRunner.scala:296)

any ideas on where to increase the timeout or something went wrong so it took that long?

@eikek
Copy link
Owner

eikek commented Mar 16, 2023

Hi @xshadowlegendx thank you very much! I don't have time to take a deeper look until the weekend, but it looks well done! I don't know why the tests can't get a lock for mariadb… really strange. But that is not related to your change, so it's fine.

@eikek eikek linked an issue Mar 16, 2023 that may be closed by this pull request
@eikek
Copy link
Owner

eikek commented Mar 16, 2023

For the scala-format errors, you can run sbt fix to reformat everything.

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 17, 2023

hello @eikek, ok I will run sbt fix to reformat, run the test again and doing test with solr using the text analysis to ensure it can tokenize and segment khmer word properly and its query as well to ensure it is working fine then I will submit for review

@xshadowlegendx
Copy link
Contributor Author

hello @eikek, I ran the test again and I think the timeout is due to my computer running slow I guess, I tried restarting and ran it again then no timeout issue anymore but this one

❯ sbt testOnly docspell.analysis.date.DateFindTest
[info] welcome to sbt 1.8.2 (Eclipse Adoptium Java 17.0.2)
[info] loading settings for project docspell-build from build.sbt,plugins.sbt ...
[info] loading project definition from /Users/xanonx/experiments/docspell/project
[info] loading settings for project root from build.sbt,version.sbt ...
[info] resolving key references (37056 settings) ...
[info] set current project to docspell-root (in build file:/Users/xanonx/experiments/docspell/)
[info] Compiling css stylesheets…
[info] Copy webjar resources from 1 files/directories.
[info] Running npx postcss /Users/xanonx/experiments/docspell/modules/webapp/src/main/styles/index.css -o /Users/xanonx/experiments/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.41.0-SNAPSHOT/css/styles.css --env development
[info] Compile elm files ...
[info] Running elm make --debug --output /Users/xanonx/experiments/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.41.0-SNAPSHOT/docspell-app.js /Users/xanonx/experiments/docspell/modules/webapp/src/main/elm/Main.elm
[info] NerModels: Filtering artifacts...
[info] Produced query js file: /Users/xanonx/experiments/docspell/modules/query/js/target/scala-2.13/docspell-query-opt.js
[info] Copy webjar resources from 1 files/directories.
docspell.logging.LazyMapTest:
  + updated value lazy 0.118s
  + get doesn't evaluate value 0.01s
docspell.logging.CapturedLoggerTest:
  + capture data 0.538s
docspell.totp.KeyTest:
  + generate and read in key 0.147s
  + generate key 0.438s
  + encode/decode json 0.086s
docspell.jsonminiq.FormatTest:
  + field selects 0.088s
  + array select 0.025s
  + anyMatch / allMatch 0.948s
  + and / or 0.004s
docspell.jsonminiq.JsonMiniQueryTest:
  + combine values on same level 0.312s
  + combine values from different levels 0.002s
  + filter single value 0.529s
  + combine filters 0.006s
  + combine fields and filter 0.01s
  + thenAny combine via or 0.003s
  + thenAll combine via and (1) 0.001s
  + thenAll combine via and (2) 0.006s
  + test for null/not null 0.208s
  + more real expressions 0.003s
  + examples 0.014s
docspell.jsonminiq.ParserTest:
  + field selects 1.08s
  + array select 0.027s
  + values 0.015s
  + anyMatch / allMatch 0.005s
  + and / or 0.003s
docspell.totp.TotpTest:
  + generate password 0.003s
  + generate stream 0.742s
  + generate HmacSHA1 with 6 characters 0.004s
  + generate HmacSHA1 with 8 characters 0.005s
  + generate HmacSHA256 with 6 characters 0.002s
  + generate HmacSHA256 with 8 characters 0.001s
  + generate HmacSHA512 with 6 characters 0.006s
  + generate HmacSHA512 with 8 characters 0.001s
  + check password at same time 0.003s
  + check password 15s later 0.001s
  + check password 29s later 0.001s
  + check password 31s later (too late) 0.001s
[info] Compiling ...
[info]
[info] Success!
[info]
[info]     Main ───> /Users/xanonx/experiments/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.41.0-SNAPSHOT/docspell-app.js
docspell.query.internal.AttrParserTest:
  + string attributes 0.517s
  + date attributes 0.002s
  + all attributes parser 0.004s
docspell.query.internal.ExprParserTest:
  + simple expr 0.733s
  + and 0.003s
  + or 0.001s
  + tag list inside and/or 0.002s
  + nest and/ with simple expr 0.002s
docspell.query.internal.ItemQueryParserTest:
  + reduce ands 0.745s
  + reduce ors 0.002s
  + reduce and/or 0.001s
  + reduce inner and/or 0.003s
  + omit and-parens around root structure 0.001s
  + throw if query is empty 0.003s
  + splice inner and nodes 0.002s
  + splice inner or nodes 0.003s
  + f.id:name=value 0.001s
docspell.query.internal.MacroParserTest:
  + recognize names shortcut 0.161s
docspell.query.internal.BasicParserTest:
  + single string values 0.005s
  + string list values 0.033s
  + stringvalue 0.002s
docspell.query.internal.OperatorParserTest:
  + operator values 0.002s
  + tag operators 0.0s
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
docspell.query.FulltextExtractTest:
  + find fulltext as root 0.006s
  + find no fulltext 0.003s
  + find fulltext within and 0.008s
  + too many fulltext searches 0.003s
  + wrong fulltext search position 0.002s
[info] No tests to run for Test / testOnly
docspell.query.internal.DateParserTest:
  + local date string 0.791s
  + local date millis 0.004s
  + local date 0.001s
  + local partial date 0.01s
  + date calcs 0.003s
  + period 0.007s
docspell.query.internal.SimpleExprParserTest:
  + string expr 0.746s
  + date expr 0.027s
  + exists expr 0.16s
  + fulltext expr 0.001s
  + category expr 0.001s
  + custom field 0.0s
  + tag id expr 0.002s
  + simple expr 0.004s
[info] Passed: Total 15, Failed 0, Errors 0, Passed 15
[info] Passed: Total 3, Failed 0, Errors 0, Passed 3
[info] Passed: Total 20, Failed 0, Errors 0, Passed 20
docspell.query.internal.ExprStringTest:
  + macro: name 0.32s
  + macro: year 0.004s
  + macro: daterange 0.011s
  + generate expr and parse it 0.96s
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for loggingScribe / Test / testOnly
[info] Passed: Total 46, Failed 0, Errors 0, Passed 46
docspell.common.util.SignUtilTest:
  + create and validate 0.024s
docspell.common.NerLabelSpanTest:
  + build 0.009s
docspell.common.UrlMatcherTest:
  + it should match patterns 0.839s
docspell.common.MetaProposalListTest:
  + flatten retains order of candidates 0.84s
  + sort by weights 0.002s
  + sort by weights: unset is last 0.001s
  + insert second 0.004s
  + insert second, remove duplicates 0.001s
docspell.common.GlobTest:
  + literals 0.67s
  + single wildcards 1 0.01s
  + single wildcards 2 0.004s
  + single parsing 0.005s
  + with splitting 0.017s
  + asString 0.004s
  + simple matches 0.003s
  + matchFilenameOrPath 0.009s
  + anyglob 0.011s
  + case insensitive 0.011s
docspell.common.FileNameTest:
  + make filename 0.007s
  + with part 0.003s
  + with extension 0.002s
docspell.common.LenientUriTest:
  + do not throw on invalid hex decoding 0.002s
  + percent-decode invalid codes 0.0s
  + percent-decode valid codes 0.043s
  + parse with trailing slash 0.01s
[info] Copy webjar resources from 1 files/directories.
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for pubsubApi / Test / testOnly
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for notificationApi / Test / testOnly
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for ftsclient / Test / testOnly
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for queryJS / Test / testOnly
docspell.common.util.DirectoryTest:
  + unwrap directory when non empty 0.447s
  + unwrap directory when not empty repeat 0.075s
  + unwrap nested directory 0.061s
  + do nothing on empty directory 0.017s
  + do nothing when directory contains more than one entry 0.02s
  + do nothing when directory contains more than one entry (2) 0.016s
docspell.common.bc.BackendCommandTest:
  + encode json 1.409s
  + decode case insensitive keys 0.204s
docspell.common.MimeTypeTest:
  + asString 0.017s
  + parse without params 0.021s
  + parse with charset 0.002s
  + parse with charset and more params 0.001s
  + parse without charset but params 0.001s
  + parse some stranger values 0.002s
  + parse invalid mime types 0.002s
  + read own asString 2.557s
[info] Passed: Total 41, Failed 0, Errors 0, Passed 41
docspell.oidc.StateParamTest:
  + generate 1.474s
  + fromString 0.01s
docspell.scheduler.CountingSchemeSpec:
  + counting 0.028s
docspell.files.TikaMimetypeTest:
  + detect text/plain 2.36s
  + detect image/jpeg 0.012s
  + detect image/png 0.005s
  + detect application/json 0.005s
  + detect application/json-1 0.024s
  + detect image/jpeg wrong advertised 0.008s
  + just filename 0.003s
[info] Passed: Total 2, Failed 0, Errors 0, Passed 2
docspell.files.ImageSizeTest:
  + get sizes from input-stream 0.304s
  + get sizes from stream 0.178s
docspell.files.ZipTest:
  + unzip 1.461s
  + unzip directories and files 0.047s
docspell.convert.ConversionTest:
  + convert to pdf 0.031s
  + convert image to pdf and txt 0.022s
  + do not convert image bombs 0.029s
docspell.addons.AddonRunnerTest:
  + firstSuccessful must stop on first success 1.207s
docspell.addons.AddonExecutorTest:
  + firstSuccessful must try with next on error 0.046s
  + do not retry on decoding errors 0.008s
  + try on errors but stop on decoding error 0.009s
docspell.convert.RemovePdfEncryptionTest:
Right(AddonOutput(List(),List(),List()))
Right(AddonOutput(List(),List(ItemFile(Ident(qZDnyGIAJsXr),Map(HPFvIDib6eA -> HPFvIDib6eA.txt),Map(HPFvIDib6eA -> HPFvIDib6eA.pdf),Map(),List())),List()))
docspell.addons.AddonOutputTest:
  + decode empty object 0.259s
  + decode sample output 0.126s
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for ftssolr / Test / testOnly
docspell.addons.AddonMetaTest:
docspell.addons.AddonArchiveTest:
  + read meta from zip file 0.706s
  + read meta from directory 0.135s
  + Read archive from directory 0.705s
  + Read archive from zip 0.03s
  + Read generated addon from path 0.144s
  + Read generated addon from zip 0.039s
  + Read minimal addon from path 0.034s
  + Read minimal addon from zip 0.017s
  + Read archive from zip file 0.023s
2023.03.17 11:30:03:0000 [io-comp...] [ERROR] docspell.addons.runner.RunnerUtil.runAddonCommand:127 - Addon addon1-1.0 returned non-zero: 1 (addon-name: "addon1", addon-version: "1.0")
2023.03.17 11:30:03:0000 [io-comp...] [ERROR] docspell.addons.runner.RunnerUtil.runAddonCommand:127 - Addon addon1-1.0 returned non-zero: 1 (addon-name: "addon1", addon-version: "1.0")
  + select docker if Dockerfile exists 0.109s
  + select nix-flake if flake.nix exists 0.011s
  + select nix-flake and docker 0.007s
  + fail early if configured so 1.0s
  + do not stop after failing addons 0.213s
  + combine outputs 0.161s
docspell.extract.ocr.TextExtractionSuite:
==> i docspell.extract.ocr.TextExtractionSuite.extract english pdf ignored 0.001s
==> i docspell.extract.ocr.TextExtractionSuite.extract german pdf ignored 0.001s
  + have encrypted pdfs 0.916s
  + decrypt pdf 0.718s
  + decrypt pdf with multiple passwords 0.361s
  + remove protection 0.192s
  + read unprotected pdf 0.112s
  + decrypt with multiple passwords, stop on first 0.155s
  + return input stream if nothing helps 0.056s
docspell.extract.pdfbox.PdfMetaDataTest:
  + split keywords on comma 0.032s
  + split keywords on semicolon 0.001s
  + split keywords on comma and semicolon 0.001s
docspell.extract.odf.OdfExtractTest:
  + test extract from odt 0.843s
[info] Passed: Total 1, Failed 0, Errors 0, Passed 1
[info] Passed: Total 11, Failed 0, Errors 0, Passed 11
[info] Passed: Total 21, Failed 0, Errors 0, Passed 21
docspell.extract.rtf.RtfExtractTest:
  + extract text from rtf using java input-stream 1.451s
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for joexapi / Test / testOnly
[info] No tests to run for restapi / Test / testOnly
docspell.extract.pdfbox.PdfboxExtractTest:
  + extract text from text PDFs by inputstream 1.383s
  + extract text from text PDFs via Stream 0.095s
  + extract text from image PDFs 0.017s
  + extract metadata from pdf 0.178s
docspell.convert.extern.ExternConvTest:
  + convert html to pdf 0.39s
  + convert office to pdf 0.027s
  + convert image to pdf 7.469s
docspell.extract.poi.PoiExtractTest:
  + extract text from ms office files 4.638s
docspell.store.qb.impl.DSLTest:
  + delete 1.689s
  + and 0.001s
docspell.store.qb.impl.SelectBuilderTest:
  + basic fragment 1.74s
docspell.store.qb.impl.ConditionBuilderTest:
  + reduce ands 0.004s
  + reduce ors 0.004s
  + mixed and / or 0.001s
  + reduce double not 0.0s
  + reduce triple not 0.001s
  + reduce not to unit 0.001s
  + remove units in and/or 0.0s
  + unwrap single and/ors 0.002s
  + reduce empty and/or 0.001s
docspell.store.queries.QueryWildcardTest:
  + replace prefix 0.002s
  + replace suffix 0.001s
  + replace both sides 0.001s
  + do not use multiple wildcards 0.002s
docspell.store.qb.QueryBuilderTest:
  + simple 0.013s
docspell.store.generator.ItemQueryGeneratorTest:
  + basic test 0.566s
  + !conc:* 0.008s
  + attach.id with wildcard 0.011s
  + attach.id with equals 0.001s
[error]
[error] warn - The `@variants` directive has been deprecated in Tailwind CSS v3.0.
[error] warn - Use `@layer utilities` or `@layer components` instead.
[error] warn - https://tailwindcss.com/docs/upgrade-guide#replace-variants-with-layer
docspell.extract.pdfbox.PdfboxPreviewTest:
  + extract first page image from PDFs 7.038s
[info] Styles built at /Users/xanonx/experiments/docspell/modules/webapp/target/scala-2.13/resource_managed/main/META-INF/resources/webjars/docspell-webapp/0.41.0-SNAPSHOT
[info] Passed: Total 13, Failed 0, Errors 0, Passed 13
docspell.notification.impl.context.TagsChangedCtxTest:
  + create tags changed message 0.688s
  + create tags changed message-1 0.028s
[info] Passed: Total 13, Failed 0, Errors 0, Passed 11, Skipped 2
[info] Passed: Total 2, Failed 0, Errors 0, Passed 2
docspell.backend.auth.AuthTokenTest:
  + validate 0.79s
  + signature 0.03s
docspell.config.EnvConfigTest:
  + convert underscores 0.034s
  + insert docspell keys 0.387s
  + find default values from reference.conf 0.002s
  + discard non docspell keys 0.004s
docspell.config.ValidationTest:
  + thread value through validations 0.96s
  + fail if there is at least one error 0.322s
[info] Passed: Total 2, Failed 0, Errors 0, Passed 2
[info] Passed: Total 6, Failed 0, Errors 0, Passed 6
[info] compiling 1 Scala source to /Users/xanonx/experiments/docspell/modules/restserver/target/scala-2.13/classes ...
2023.03.17 11:30:15:747 pool-6-thread-10 INFO org.testcontainers.utility.ImageNameSubstitutor
    Image name substitution will be performed by: DefaultImageNameSubstitutor (composite of 'ConfigurationFileImageNameSubstitutor' and 'PrefixingImageNameSubstitutor')
2023.03.17 11:30:16:108 pool-6-thread-10 INFO org.testcontainers.dockerclient.DockerClientProviderStrategy
    Loaded org.testcontainers.dockerclient.UnixSocketClientProviderStrategy from ~/.testcontainers.properties, will try it first
[info] compiling 1 Scala source to /Users/xanonx/experiments/docspell/modules/joex/target/scala-2.13/classes ...
2023.03.17 11:30:19:591 pool-6-thread-10 INFO org.testcontainers.dockerclient.DockerClientProviderStrategy
    Found Docker environment with local Unix socket (unix:///var/run/docker.sock)
2023.03.17 11:30:19:604 pool-6-thread-10 INFO org.testcontainers.DockerClientFactory
    Docker host IP address is localhost
    Connected to docker:
      Server Version: 20.10.14
      API Version: 1.41
      Operating System: Docker Desktop
      Total Memory: 3934 MB
2023.03.17 11:30:20:123 pool-6-thread-10 INFO 🐳 [testcontainers/ryuk:0.3.4]
    Creating container for image: testcontainers/ryuk:0.3.4
2023.03.17 11:30:20:416 pool-6-thread-10 INFO org.testcontainers.utility.RegistryAuthLocator
    Credential helper/store (docker-credential-desktop) does not have credentials for https://index.docker.io/v1/
2023.03.17 11:30:22:067 pool-6-thread-10 INFO 🐳 [testcontainers/ryuk:0.3.4]
    Container testcontainers/ryuk:0.3.4 is starting: 5d78a0f8e0bf32c0392ccb43e9a1d43d30613c8282d072c33c535cdd3544d4fa
2023.03.17 11:30:23:971 pool-6-thread-10 INFO 🐳 [testcontainers/ryuk:0.3.4]
    Container testcontainers/ryuk:0.3.4 started in PT4.17453S
2023.03.17 11:30:24:035 pool-6-thread-10 INFO org.testcontainers.utility.RyukResourceReaper
    Ryuk started - will monitor and terminate Testcontainers containers on JVM exit
2023.03.17 11:30:24:037 pool-6-thread-10 INFO org.testcontainers.DockerClientFactory
    Checking the system...
    ✔︎ Docker server version should be at least 1.6.0
2023.03.17 11:30:24:042 pool-6-thread-11 INFO 🐳 [postgres:14]
    Creating container for image: postgres:14
2023.03.17 11:30:24:042 pool-6-thread-10 INFO 🐳 [postgres:14]
    Creating container for image: postgres:14
    Container postgres:14 is starting: 92ae70eb91bff4cdbff67681cf952269bdb1a8f098f0d32c8377db117eefff63
2023.03.17 11:30:24:484 pool-6-thread-11 INFO 🐳 [postgres:14]
    Container postgres:14 is starting: 5342298e7feac2c228205d93190fadd75d1b0930cfc5b245cad9f5750cba2df6
docspell.pubsub.naive.NaivePubSubTest:
  + local publish receives message 9.568s
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for webapp / Test / testOnly
2023.03.17 11:30:39:327 pool-6-thread-11 INFO 🐳 [postgres:14]
    Container postgres:14 started in PT15.285896S
docspell.analysis.date.DateFindTest:
  + find simple dates 6.387s
  + skip invalid dates 0.017s
  + different date formats 0.014s
  + more english variants 0.015s
  + find latvian dates 0.028s
  + find japanese dates 0.02s
  + find spanish dates 0.02s
  + find lithuanian dates 0.048s
  + find polish dates 0.029s
  + find estonian dates 0.034s
  + find ukrainian dates 0.088s
docspell.analysis.contact.ContactAnnotateSpec:
  + find email 0.054s
docspell.analysis.classifier.StanfordTextClassifierSuite:
  + learn from data 0.519s
  + run classifier 0.026s
docspell.analysis.split.TestSplitterSpec:
  + simple splitting 0.01s
docspell.ftspsql.MigrationTest:
  + create schema 3.947s
docspell.ftspsql.PsqlFtsClientTest:
  + local publish to different topic doesn't receive 4.638s
  + receive messages remotely 5.129s
  + send messages remotely 4.522s
  + do not receive remote message from other topic 3.094s
[info] Passed: Total 5, Failed 0, Errors 0, Passed 5
docspell.joex.updatecheck.UpdateCheckTest:
  + parse example response 1.443s
  + snapshot is matches 0.008s
  + newer version does not match  0.002s
  + same version matches 0.001s
docspell.joex.analysis.NerFileTest:
  + create valid case insensitive patterns 1.879s
docspell.joex.extract.JsoupSanitizerTest:
  + keep interesting tags and attributes 0.564s
[info] Passed: Total 6, Failed 0, Errors 0, Passed 6
  + insert data into index 4.24s
  + clear index 0.688s
  + clear index by collective 0.745s
  + search by query 0.869s
[info] Passed: Total 5, Failed 0, Errors 0, Passed 5
docspell.restserver.http4s.ContentDispositionTest:
  + allow rfc2231 parameters with charset 6.132s
  + allow rfc2231 parameters with charset and language 0.002s
  + allow rfc2231 parameters without charset and language 0.003s
  + allow rfc2231 parameters with quoted strings 0.003s
  + allow utf8 bytes in filename 0.008s
  + unicode in filename with original header impl and filename* 0.037s
  + allow simple values 0.006s
[info] Passed: Total 7, Failed 0, Errors 0, Passed 7
docspell.store.migrate.MigrateTest:
  + postgres empty schema migration 17.863s
docspell.analysis.nlp.BaseCRFAnnotatorSuite:
  + find english ner labels 10.806s
  + find german ner labels 23.556s
docspell.analysis.nlp.StanfordNerAnnotatorSuite:
  + find english ner labels 13.318s
  + find german ner labels 28.063s
  + regexner-only annotator 0.263s
[info] Passed: Total 20, Failed 0, Errors 0, Passed 20
docspell.store.fts.TempFtsOpsTest:
  + create temporary table 0.436s
  + mariadb empty schema migration 23.667s
  + h2 empty schema migration 2.017s
  + h2 upgrade db from 0.24.0 2.088s
docspell.scheduler.impl.QJobTest:
  + selectNextGroup on empty table (PostgreSQL) 0.162s
  + query items sql 5.317s
[info] Passed: Total 27, Failed 0, Errors 0, Passed 27
  + selectNextGroup on empty table (MariaDB) 0.036s
  + selectNextGroup on empty table (H2) 0.058s
  + set group must insert or update (PostgreSQL) 0.027s
  + set group must insert or update (MariaDB) 0.021s
  + set group must insert or update (H2) 0.016s
  + selectNextGroup should return first group on initial state (PostgreSQL) 0.111s
  + selectNextGroup should return first group on initial state (MariaDB) 0.063s
  + selectNextGroup should return first group on initial state (H2) 0.035s
  + selectNextGroup should return second group on subsequent call (PostgreSQL) 0.075s
  + selectNextGroup should return second group on subsequent call (MariaDB) 0.068s
  + selectNextGroup should return second group on subsequent call (H2) 0.021s
  + selectNextGroup should return first group on subsequent call (PostgreSQL) 0.068s
  + selectNextGroup should return first group on subsequent call (MariaDB) 0.063s
  + selectNextGroup should return first group on subsequent call (H2) 0.024s
[info] Passed: Total 15, Failed 0, Errors 0, Passed 15
[success] Total time: 123 s (02:03), completed Mar 17, 2023, 11:31:49 AM
[error] Expected ID character
[error] Not a valid command: docspell (similar: shell, oldshell)
[error] Expected project ID
[error] Expected configuration
[error] Expected ':'
[error] Expected key
[error] Not a valid key: docspell (similar: doc, daemonShell)
[error] docspell.analysis.date.DateFindTest
[error]

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 17, 2023

hello @eikek, so I ran PLATFORMS=linux/amd64 ./build.sh 0.40.0 and it was successful, but how do I test with changes I have made? the command only test the docker image build? but those dockerfiles seems to be fetching joex and restserver from upstream releases, so in order to build those images with changes I made I would have to create releases in the forked repo then set joex_url, restserver_url and version according to releases in forked repo or build those zip then modify those dockerfiles to get zip locally when building image to test on local machine.

so question is how do I able to test changes I have made locally using docker or build the docspell-restserver-x.zip and docspell-joex-x.zip

edit0: oh sorry, forgot to read this page which say how to make those zips

edit1: update comment to include approach to build zip locally then modify dockerfiles to build locally for testing

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 17, 2023

hello @eikek , I have tested the changes by

  • logging into the web, going to settings and set khmer language, seeing ocrmypdf correctly passing -l khm to process documents
  • going into solr to look for content_kh and seeing icu tokenizer being applied to the type, checking text analysis in solr to ensure correctly tokenize and segment of khmer text using content_kh type
  • try searching some khmer text to see document getting filtered on the web

check-ocrmypdf-correct-supply-khm-lang-flag-to-tesseract
test-icu-tokenizer-being-used-on-content-kh
simple-khmer-searching-test
before-searching

edit0: those document that were used for testing are docx, text pdf and pdf with scanned image

@xshadowlegendx
Copy link
Contributor Author

but why is this docx shows broken character? but searches show content being parsed correctly from the doc.

Screenshot 2023-03-17 at 19 56 26
Screenshot 2023-03-17 at 19 56 41

@eikek
Copy link
Owner

eikek commented Mar 18, 2023

Thank you very much for this comprehensive testing @xshadowlegendx !

❯ sbt testOnly docspell.analysis.date.DateFindTest

The issue here is that you need to pass it as one single argument to sbt, so just quote it like sbt "testOnly docspell.analysis.date.DateFindTest" - or drop into an sbt shell by just running sbt without any commands and then run testOnly ….

PLATFORMS=linux/amd64 ./build.sh 0.40.0

This commands builds new dockerimages, but it downloads the prebuild zip files from github using the given version (here 0.40.0). Guess you found out already. Easiest way for testing is to build the zip files and modify the dockerfiles to use that zip file. If you just want to test docspell without docker, you can drop into a sbt shell and run reStart - this will startup docspell from the sources.

but why is this docx shows broken character? but searches show content being parsed correctly from the doc.

I suspect there is a font missing on the system. This is always a bit tricky to find out.

SOLR:
I saw you added something to the solr command on the dockerfiles: -f -Dsolr.modules=analysis-extras - could you explain just a bit what this does? Does this mean people not using docker need to also use this for their SOLR?

@xshadowlegendx
Copy link
Contributor Author

hello @eikek, -f flag starts solr in foreground mode, for -Dsolr.modules=analysis-extras, this enable the analysis-extras module which is required in order for the icu tokenizer to works, without it khmer tokenization and segmentation will not work . The document for the icu tokenizer for khmer language is from official solr doc mentioned here(solr 9 at the time of writing)

@xshadowlegendx
Copy link
Contributor Author

for the font missing part, I will try to investigate about it

@eikek
Copy link
Owner

eikek commented Mar 18, 2023

hello @eikek, -f flag starts solr in foreground mode, for -Dsolr.modules=analysis-extras, this enable the analysis-extras module which is required in order for the icu tokenizer to works, without it khmer tokenization and segmentation will not work . The document for the icu tokenizer for khmer language is from official solr doc mentioned here(solr 9 at the time of writing)

Thank you! I suspected it would add optional code to SOLR. The consequence is now that this affects every user that runs SOLR not via docker. Perhaps it would be good to mention it in the docs, maybe here.

@xshadowlegendx
Copy link
Contributor Author

yes, I will help update the docs as well

@xshadowlegendx
Copy link
Contributor Author

hello @eikek, I have edited the docs, but how do I see my changes locally?

@eikek
Copy link
Owner

eikek commented Mar 20, 2023

hello @eikek, I have edited the docs, but how do I see my changes locally?

There is a bit in the website readme here. The scripts folder in the website dir has some shortcuts. You can use run-elm.sh to build the js and run-styles.sh to build the css. Then do a zola serve to show the page at localhost:1111

@xshadowlegendx
Copy link
Contributor Author

ok thanks @eikek, I will try to run it

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 21, 2023

hello @eikek, running zola serve return me this

Error: Failed to serve the site
Error: Failed to render content of /Users/xanonx/github/docspell/website/site/content/docs/tools/smtpgateway.md
Error: Reason: Failed to render incl_conf shortcode
Error: Reason: Failed to render 'shortcodes/incl_conf.md'
Error: Reason: Function call 'load_data' failed
Error: Reason: `load_data`: /Users/xanonx/github/docspell/website/site/templates/shortcodes/sample-exim.conf doesn't exist

and zola build

Error: Failed to build the site
Error: Failed to render content of /Users/xanonx/experiments/docspell/website/site/content/docs/configure/defaults.md
Error: Reason: Failed to render incl_conf shortcode
Error: Reason: Failed to render 'shortcodes/incl_conf.md'
Error: Reason: Function call 'load_data' failed
Error: Reason: `load_data`: /Users/xanonx/experiments/docspell/website/site/templates/shortcodes/server.conf doesn't exist

@eikek
Copy link
Owner

eikek commented Mar 22, 2023

@xshadowlegendx You might need to use an older version, like 0.14.1 should work (that is used by ci)

@xshadowlegendx
Copy link
Contributor Author

hello @eikek, is it elm version?

@eikek
Copy link
Owner

eikek commented Mar 24, 2023

Sorry for the delay! I meant the version of zola. The errors are coming from zola if I understand correctly. It happens that newer versions have more checks or other changes that require to revisit the pages. Using 0.14.1 should work.

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 27, 2023

I have tried using 0.14.1 one and it still output the same error. So I tried running nix-shell --run 'cd site && zola serve but it throws me error about inotify-tools-3.21.9.5 not supported on x86_64-darwin, so then I tried installing zola and then just go into site dir and do zola serve it then throws me the previous error I have asked

Error: Failed to serve the site
Error: Failed to render content of /Users/xanonx/github/docspell/website/site/content/docs/tools/smtpgateway.md
Error: Reason: Failed to render incl_conf shortcode
Error: Reason: Failed to render 'shortcodes/incl_conf.md'
Error: Reason: Function call 'load_data' failed
Error: Reason: `load_data`: /Users/xanonx/github/docspell/website/site/templates/shortcodes/sample-exim.conf doesn't exist

I think the error is about missing files? not sure but the output seems to indicate so

@xshadowlegendx
Copy link
Contributor Author

So I went to check those file that the program is expecting and it wasnt there, those seems to be config files, so I tried copying exim.conf from tools/exim/exim.conf to /Users/xanonx/github/docspell/website/site/templates/shortcodes/sample-exim.conf and run zola serve again then other errors about missing server.conf and joex.conf appears, then I tried copying those config from resources dir respectively from modules/restserver/src/main/resources/reference.conf and modules/joex/src/main/resources/reference.conf to those and then run zola serve again, then another of those missing file error appear again but different one, currently this is the one that is missing from writting.md

❯ zola serve
Building site...
Error: Failed to render content of /Users/xanonx/experiments/docspell/website/site/content/docs/addons/writing.md
Reason: Failed to render incl_json shortcode
Reason: Failed to render 'shortcodes/incl_json.md'
Reason: Function call 'load_data' failed
Reason: `load_data`: /Users/xanonx/experiments/docspell/website/site/templates/shortcodes/addon-output doesn't exist

do I have to get those files from somewhere?

@eikek
Copy link
Owner

eikek commented Mar 28, 2023

@xshadowlegendx Yes so these files are in the source root, because they are used within the application. The build tool copies them in the appropriate place when creating the website. I think the easiest is to run sbt make-website (only needed once for copying necessary files)

@xshadowlegendx
Copy link
Contributor Author

@eikek ok thanks, I will try it again

@xshadowlegendx
Copy link
Contributor Author

hello @eikek what do you think about this

Screenshot 2023-03-29 at 15 09 43

@xshadowlegendx
Copy link
Contributor Author

xshadowlegendx commented Mar 29, 2023

regarding to this, I have confirmed that it is the missing font issue, so I exec into joex container, ran the unoconv command, it returns me ???? symbols in the converted pdf, and after I install khmer font by doing apk add font-noto-khmer and run the convert command again it shows the text correctly now

but why is this docx shows broken character? but searches show content being parsed correctly from the doc.

Screenshot 2023-03-17 at 19 56 26 Screenshot 2023-03-17 at 19 56 41

after the font is installed

Screenshot 2023-03-29 at 16 16 36

edit0: fixed grammar

@eikek
Copy link
Owner

eikek commented Mar 30, 2023

Thank you @xshadowlegendx for all your efforts!

@xshadowlegendx
Copy link
Contributor Author

hello @eikek thank you as well for your amazing work on this project, do u think the docs is good to be pushed?

@eikek
Copy link
Owner

eikek commented Mar 31, 2023

hello @eikek thank you as well for your amazing work on this project, do u think the docs is good to be pushed?

Yes sure, that is good! I left a small 'thumbs up" 😄 You don't need to bother about the dead link in the docs. Sadly some site has been taken offline.

@xshadowlegendx
Copy link
Contributor Author

@eikek oh haha sorry did not notice the "thumbs up", thanks you very much

@xshadowlegendx xshadowlegendx marked this pull request as ready for review April 1, 2023 05:18
@eikek eikek merged commit fd6b7ce into eikek:master Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add khmer language
2 participants