endpoint for embeddings #814

pseudotensor · 2023-09-07T20:44:18Z

gunicorn: https://medium.com/huggingface/scaling-a-massive-state-of-the-art-deep-learning-model-in-production-8277c5652d5f

We used [falcon](https://falconframework.org/) for the web servers(any other http framework would have worked too) in conjunction with [gunicorn](https://gunicorn.org/) to run our instances and balance the load. Our own [GPT-2 Pytorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) is the backbone of this project. We have a few examples in our [examples directory](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples) if you’re interested in doing something similar.

Gunicorn sets up “workers” which will independently run the application, efficiently balancing the load across different workers. You can check exactly how they work on the [official gunicorn documentation](http://docs.gunicorn.org/en/stable/design.html).

HF-supported server: https://localai.io/features/embeddings/index.html

Others:
https://python.langchain.com/docs/integrations/text_embedding/xinference
https://python.langchain.com/docs/integrations/text_embedding/localai

The text was updated successfully, but these errors were encountered:

pseudotensor · 2023-10-13T21:03:50Z

https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings
https://github.com/amansrivastava17/embedding-as-service
https://github.com/go-skynet/LocalAI (https://github.com/go-skynet/LocalAI/blob/master/tests/models_fixtures/grpc.yaml)

Some are just hosting, while others are for speed.

pseudotensor · 2023-11-05T07:01:23Z

https://github.com/ELS-RD/kernl
https://www.reddit.com/r/MachineLearning/comments/10xp54e/p_get_2x_faster_transcriptions_with_openai/

pseudotensor · 2023-11-11T07:03:01Z

https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings

https://github.com/amansrivastava17/embedding-as-service

https://github.com/ivanpanshin/flask_gunicorn_nginx_docker

https://python.langchain.com/docs/integrations/text_embedding/self-hosted
https://github.com/xorbitsai/inference

pseudotensor · 2024-01-04T19:23:55Z

https://github.com/huggingface/text-embeddings-inference#docker-images

Far0n · 2024-01-09T17:22:34Z

@pseudotensor I checked https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings:

doesn't work out of the box and results in "bash: convert_model: command not found"
issue reported in may 2023 (convert_model command not found ELS-RD/transformer-deploy#173), but no activity in that repo for 8 month
tried the workaround (convert_model command not found ELS-RD/transformer-deploy#173 (comment))

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
    convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
    --backend tensorrt onnx \
    --task embedding \
    --seq-len 16 128 128"

after that I'm getting:

[01/09/2024-13:11:01] [TRT] [E] 3: [builderConfig.cpp::validatePool::313] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/builderConfig.cpp::validatePool::313, condition: false. Setting DLA memory pool size on TensorRT build with DLA disabled.
)
[01/09/2024-13:11:01] [TRT] [W] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/09/2024-13:11:01] [TRT] [W] building engine. depending on model size this may take a while
[01/09/2024-13:11:02] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::2945] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
[01/09/2024-13:11:02] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 494, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 311, in main
    engine: ICudaEngine = build_engine(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 206, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f6c7de46170>, None
free(): invalid pointer

Overall not a first good impression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

endpoint for embeddings #814

endpoint for embeddings #814

pseudotensor commented Sep 7, 2023 •

edited

Loading

pseudotensor commented Oct 13, 2023

pseudotensor commented Nov 5, 2023

pseudotensor commented Nov 11, 2023 •

edited

Loading

pseudotensor commented Jan 4, 2024

Far0n commented Jan 9, 2024

endpoint for embeddings #814

endpoint for embeddings #814

Comments

pseudotensor commented Sep 7, 2023 • edited Loading

pseudotensor commented Oct 13, 2023

pseudotensor commented Nov 5, 2023

pseudotensor commented Nov 11, 2023 • edited Loading

pseudotensor commented Jan 4, 2024

Far0n commented Jan 9, 2024

pseudotensor commented Sep 7, 2023 •

edited

Loading

pseudotensor commented Nov 11, 2023 •

edited

Loading