Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] model deployment fails -- Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager #3207

Open
jovanovic-milos opened this issue Nov 9, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jovanovic-milos
Copy link

jovanovic-milos commented Nov 9, 2024

What is the bug?
Deployment of model is failing because of what seems to be an exception in ml-commons.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Prepare multilingual-e5-large model with optimum export (https://huggingface.co/intfloat/multilingual-e5-large)
  2. ZIP model directory
  3. Register the model to OpenSearch via API
  4. Deploy the model
  5. Check OpenSearch logs (sometimes connection timed out error pops up too, in this case i just try to deploy the model again)

What is the expected behavior?
Successful deployment of the model

What is your host/environment?
OpenSearch 2.18 running in Docker

Do you have any additional context?
org.opensearch.ml.common.exception.MLException: Failed to deploy model w1BJEpMBbOORGaoAR7h5 2024-11-09T19:29:46.547698532Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:300) ~[?:?] 2024-11-09T19:29:46.547704056Z at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?] 2024-11-09T19:29:46.547708040Z at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) ~[?:?] 2024-11-09T19:29:46.547723453Z at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) ~[?:?] 2024-11-09T19:29:46.547727230Z at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?] 2024-11-09T19:29:46.547730758Z at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1083) ~[?:?] 2024-11-09T19:29:46.547734525Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547738193Z at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$73(MLModelManager.java:1703) [opensearch-ml-2.17.0.0.jar:2.17.0.0] 2024-11-09T19:29:46.547741754Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547745270Z at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547748852Z at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547752467Z at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547755951Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?] 2024-11-09T19:29:46.547759414Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?] 2024-11-09T19:29:46.547762898Z at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] 2024-11-09T19:29:46.547766339Z Caused by: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager 2024-11-09T19:29:46.547769823Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547773286Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547779006Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547782609Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547786115Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547789621Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547795624Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547801105Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547804633Z ... 14 more 2024-11-09T19:29:46.547808106Z Caused by: java.lang.ExceptionInInitializerError: Exception ai.djl.engine.EngineException: Failed to save pytorch index file [in thread "opensearch[opensearch-node][opensearch_ml_deploy][T#7]"] 2024-11-09T19:29:46.547813577Z at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[?:?] 2024-11-09T19:29:46.547822391Z at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[?:?] 2024-11-09T19:29:46.547826200Z at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[?:?] 2024-11-09T19:29:46.547829717Z at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[?:?] 2024-11-09T19:29:46.547833234Z at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?] 2024-11-09T19:29:46.547836783Z at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[?:?] 2024-11-09T19:29:46.547840279Z at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[?:?] 2024-11-09T19:29:46.547843698Z at ai.djl.engine.Engine.getInstance(Engine.java:145) ~[?:?] 2024-11-09T19:29:46.547847149Z at ai.djl.onnxruntime.engine.OrtEngine.getAlternativeEngine(OrtEngine.java:75) ~[?:?] 2024-11-09T19:29:46.547850623Z at ai.djl.ndarray.BaseNDManager.<init>(BaseNDManager.java:64) ~[?:?] 2024-11-09T19:29:46.547854324Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:42) ~[?:?] 2024-11-09T19:29:46.547858210Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:35) ~[?:?] 2024-11-09T19:29:46.547861911Z at ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>(OrtNDManager.java:177) ~[?:?] 2024-11-09T19:29:46.547865450Z at ai.djl.onnxruntime.engine.OrtNDManager.<clinit>(OrtNDManager.java:37) ~[?:?] 2024-11-09T19:29:46.547869043Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547872635Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547876120Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547879582Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547884022Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547887604Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547891131Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547894789Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547898415Z ... 14 more

@jovanovic-milos jovanovic-milos added bug Something isn't working untriaged labels Nov 9, 2024
@jovanovic-milos jovanovic-milos changed the title [BUG] [BUG] model deployment fails -- Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager Nov 9, 2024
@mingshl
Copy link
Collaborator

mingshl commented Nov 19, 2024

@jovanovic-milos can you please share the command how you register the model? we need to reproduce the issue. Please let us know the model type that you used. thanks

@mingshl mingshl removed the untriaged label Nov 19, 2024
@mingshl mingshl moved this to On-deck in ml-commons projects Nov 19, 2024
@jovanovic-milos
Copy link
Author

jovanovic-milos commented Nov 25, 2024

Hey @mingshl,

thanks for replying! I couldn't reproduce the issue since last week and now the deployment seems to be working again. I didn't change anything in my project and im still using the latest docker image of OpenSearch. But just in case you want to try it out:

POST https://localhost:9200/_plugins/_ml/models/_register

{ "name": "intfloat/multilingual-e5-large", "version": 1, "description": "Multilingual E5-Large", "model_format": "ONNX", "model_group_id": "{{MODEL_GROUP_ID}}", "model_content_hash_value": "82b21651057d4a0560b7538ad0425a0e5e1533acfc3ae5d8c04f61d5f8ace048", "model_task_type": "TEXT_EMBEDDING", "model_config": { "model_type": "xlmroberta", "embedding_dimension": 1024, "framework_type": "sentence_transformers", "pooling_mode": "MEAN", "normalize_result": true, "all_config": "{\"_attn_implementation_autoset\":true,\"_name_or_path\":\"intfloat\/multilingual-e5-large\",\"architectures\":[\"XLMRobertaModel\"],\"attention_probs_dropout_prob\":0.1,\"bos_token_id\":0,\"classifier_dropout\":null,\"eos_token_id\":2,\"export_model_type\":\"transformer\",\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":1024,\"initializer_range\":0.02,\"intermediate_size\":4096,\"layer_norm_eps\":0.00001,\"max_position_embeddings\":514,\"model_type\":\"xlm-roberta\",\"num_attention_heads\":16,\"num_hidden_layers\":24,\"output_past\":true,\"pad_token_id\":1,\"position_embedding_type\":\"absolute\",\"torch_dtype\":\"float32\",\"transformers_version\":\"4.46.2\",\"type_vocab_size\":1,\"use_cache\":true,\"vocab_size\":250002}" }, "url": "file:///usr/share/opensearch/models/multilingual-e5-large_onnx/multilingual-e5-large.zip" }

After the registration was finished i simply called:

POST https://localhost:9200/_plugins/_ml/models/_sl0A5MB6BicdzaxZO3v/_deploy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: On-deck
Development

No branches or pull requests

2 participants