Further LLM Support #4732

vdwals · 2024-12-19T21:21:52Z

vdwals
Dec 19, 2024

Hi all,

I greatly appreciate the recent transformations added for LLM support. I started a similar implementation for customer projects and created this plugin: GitHub Link. Especially the "semantic search" was a very useful feature.

In my implementation, I introduced three new metadata types specifically tailored for LLM use cases. Now, I’m considering whether it would be beneficial to integrate my work into the main codebase and align the metadata usage, or if it’s better to continue developing it as a standalone plugin.

I’d love to hear your thoughts on this.

Best regards,
Dennis

bamaer · 2024-12-19T21:26:49Z

bamaer
Dec 19, 2024
Collaborator

sounds great, the more functionality the better IMO. Can you check your link? It returns a 404, could it be that your repository is set to private?

1 reply

vdwals Dec 20, 2024
Author

Apologies for the mix-up! It was set to private as I intended to finalize it before publishing. However, the latest transformations in Hop might present an exciting opportunity to revisit and enhance it further.

ep9io · 2025-01-02T10:54:04Z

ep9io
Jan 2, 2025

Hi @vdwals,

I contributed to the language model chat plugin. 😊

I had a look through your GitHub project and was pleased to see you've implemented a plugin for embeddings. I initially started working on embeddings (for vector searching) and language model APIs (for prompting). My project required more focus on the prompting side, so I managed to complete that plugin, but I left the embeddings plugin unfinished. In my prototype, I used it to compute embeddings using BERT (hard coded). Your version looks far more complete!

Regarding "embeddingmodels", as @bamaer said, the more functionality the better. Langchain4J's embeddings would be an excellent addition, and I'd love to try it out someday.

Regarding "languagemodels", I noticed that there is some overlap with the existing plugin (in HOP). As your plugin seems to currently support only Ollama, it might be best to integrate any new functionality that your code has into the existing plugin that supports more than just Ollama. I’d be happy to assist with that if needed!

Regarding "embeddingstores", that complements the embeddingmodels from what I can see and it has support for in-memory and neo4j, which is great.

Regarding "semanticsearch", from a quick glance, that seems to be the plugin that reuses the "embeddingsmodel" meta for obtaining the model to create embeddings from text and the "embeddingstores" for storing those embeddings (vectors). Then, submitting the search request to the store and outputting the relevant matches along with their scores. I do have some follow up questions, but I can ask them in a separate comment.

Regarding "tableextraction", I only glanced at it quickly and it appears to be a domain specific task, but has potential to be turned into something others can use. It prompts a model and returns a result, with the result being some form of translation from json extracts into value types. Sounds like something @bjackson-ep was interested in or working with, from what he discussed with me at one stage.

Thanks for sharing your repository and the work you've been doing, it's all interesting.

0 replies

usbrandon · 2025-01-02T14:59:24Z

usbrandon
Jan 2, 2025
Collaborator

Hi @vdwals (Dennis),

I am happy to see you and Tristan contributing more innovation and thoughts in this area. I am super interested in your work and furthering developments in these areas. I agree with Tristan to try to make any action/step as friendly as possible to using the popular APIs & Ollama. I will check out your github repo and try what you've created in the latest Hop.

As far as embeddings go, I am very interested in the semantic search nature of creating and using embeddings. Right now it seems like there are two separate tasks. Creating embeddings and then applying one as a search query for some kind of top n responses of closest matchs, like the minimum cosine difference of the responses. Since it may not just be a matter of putting a single approach to work, what could we do to help people have a fuller exploratory cycle so they can understand what is happening? For example, the way we can us t-SNE to reduce the dimensionality of the embeddings to put them on a 2d or 3d plot. https://www.datacamp.com/tutorial/introduction-t-sne
Being someone new in an applied sense, in my own repo, I had to create singular examples to try to break down concepts to understand how to apply them in a system. Hop, generally, also breaks activities down into singular actions that make composable and reusable blocks.

Areas where I am trying to apply these in the real world are for mastering data. Say you have rows from several systems representing a person or product, how can we automatically steward those to come to a reasonable conclusion that they are similar enough to be called 'a person' or 'a product'? That way we can generate durable (master keys) for them, and the data become joinable across different systems.

To do things like that, but also referring back to embeddings, I am curious how knowledge graphs help that along. The embeddings themselves we know are in high dimensional space, but the graphs, as I currently know, they may represent an ontology, but I am not sure how at present others encode the semantic meaning that exist due to the shape of the graph into an embedding that models will understand.

My examples are super simple, but perhaps they will inspire some ideas. Probably over this next week I will start populating more and more ideas for the data quality approaches. My first inclination there was to use the function calling abilities of the LLM to provide guarantees about the layout of the outputs. That way we could bring questions into the model and get a predictable output JSON structure and convert that to a tabular form Hop understands.
https://github.com/usbrandon/gptplayground

What kinds of things are you trying to solve using these technologies?

I will take a deeper look at your work this week and reply back more concretely on the plugins. Again. Thank you for putting so much thought and effort into these things. We are shaping the future for that experience for many users who will encounter Hop so the thought and time is worth it.

Warmly,

Brandon

0 replies

ep9io · 2025-01-02T16:28:50Z

ep9io
Jan 2, 2025

I can share the approach I was taking with the use of LangChain4J's embeddings and search, and see if it makes sense or helps with a direction.

The overall idea is to split the tasks into independent plugins so that they can be reused more broadly and leverage other plugins within the HOP's ecosystem.

Create embeddings
Persist embeddings.
Retrieve embeddings
Compare/search embeddings

Plugin 1: Create the embeddings. In other words, transform text into number arrays.

Inputs: text field and model details
Outputs: array of numbers

This plugin selects an input field as the input value (e.g. text value) and passes it through the model (typically an encoder like bert). What you get back is a dense set (embeddings). They might also be called feature vectors, word vectors, latent space, and to some extent logits. They can look like this:

[0.12, -0.34, 0.56, 0.78, -0.91] This could represent a word in a 5-dimensional embedding space.

Your plugin already does this in the line highlighted below:

private void addToVector(String textValue, Object keyValue, Object[] lookupRowValues)
...
      TextSegment segment = TextSegment.from(textValue);
  >>> Embedding embedding = data.embeddingModel.embed(segment).content();  <<<
      String keyString = String.valueOf(keyValue);
      data.embeddingStore.add(keyString, embedding);
...
  }

These numbers on their own are quite useful and the output of this plugin is only to produce these arrays and nothing else. Think of it as a pre-processing step.

Attaching a parquet/csv/json output file after this plugin has many applications. This plugin alone can be used to create a dataset that can be used or reused further down the workflow. In other words, once you create the embeddings, you don't need to be recreating them every time you are running a task. The pipelines or external tasks further down the workflow could be reading a parquet file that contains these embeddings.

Plugin 2 (concept): Store the embeddings

Inputs: array of numbers
Output: Not sure, perhaps saving results or the IDs of the entries.

Your plugin implements this concept:

private void addToVector(String textValue, Object keyValue, Object[] lookupRowValues)
...
      TextSegment segment = TextSegment.from(textValue);
     Embedding embedding = data.embeddingModel.embed(segment).content();
      String keyString = String.valueOf(keyValue);
   >>> data.embeddingStore.add(keyString, embedding);  <<<
...
  }

Saving to a text file or parquet is one option that's already out of the box in HOP, and quite useful.
However, those are handy for backups, troubleshooting, or further analysis.
A more practical method is to store them into a database that's designed for vectors, as your plugin does already.
From what I can see in your plugin, it uses in-memory and neo4j. The plugin I was working on was going to save to Milvus.

This plugin would be similar to existing output HOP transforms in that it receives inputs (vectors) and persists them somewhere designed for vectors. The in-memory option probably won't make much sense here, but it can in plugin 4.

Plugin 3 (general concept): The inverse of plugin 2, the input transform.

Inputs: file location or address, depending on the database type.
Outputs: array of numbers.

Similar to HOP's Text File, json, parquet, DBs, it reads from a source and outputs the values from that source. I haven't used Neo4J's option so not sure if that can be reused, but for other sources such as Milvus (or whatever vector database), something would need to be created. All that is not out of the box in HOP (e.g. parquet, text, etc.) would fall under this plugin.

Plugin 4: Search/Similarity/etc.

Inputs: embedding ids, Array of numbers, model details, and perhaps searching parameters.
Outputs: Generated records containing the relevant matches along with their scores.

This is pretty much what your code does:

 private List<Object[]> getFromVector(Object[] keyRow) throws HopValueException {
...
    String lookupValueString = getInputRowMeta().getString(keyRow, data.indexOfMainField);

    Embedding queryEmbedding = data.embeddingModel.embed(lookupValueString).content();
    EmbeddingSearchRequest esr = new EmbeddingSearchRequest(queryEmbedding, data.maxResults, null, null);
    EmbeddingSearchResult<TextSegment> relevant = data.embeddingStore.search(esr);

    List<Object[]> retList = new ArrayList<Object[]>(data.maxResults);

    for (EmbeddingMatch<TextSegment> embeddingMatch : relevant.matches()) {
      String key = embeddingMatch.embeddingId();
      Double score = embeddingMatch.score();

      Object[] matchedRow = data.look.get(key);
      Object resultKey = matchedRow[SemanticSearchData.KEY_INDEX];
      String value = (String) matchedRow[SemanticSearchData.VALUE_INDEX];
      ...
  }

As for cosine similarity and t-SNE, as @usbrandon mentioned, I would leave those out. Linear methods such as PCA and non-linear like t-SNE etc. are other forms of (pre/post) processing steps (e.g. analysis, dimensionality reduction) and perhaps left for another HOP plugin or an external process such as python to handle. For example, those can be done against the embeddings dataset that was produced by plugin 1 via python. Cosine similarity is one of the most common use functions in that it measures the direction of the vectors. However, there are other similarity search functions (euclidean, dot product, etc.) and I would leave that up to the vector database or external tool to handle. They'll do a better job than something implemented in HOP. For example, they might use the GPU for better performance (e.g. Meta's Faiss). However, plugin 4 could allow for extra parameters that are passed to the database/tool (e.g. to use a different function).

Regarding in-memory, it can be tricky for it to scale if it's large, but still serves a useful purpose. Your plugin already has it and there are other forms as well. I was trying to fit Meta's Faiss into HOP, but abandoned it and instead used it as an in-memory similarity search (using euclidean distance) via Python. Plugin 4 could allow to read an input stream into memory and then do the searching using the values from another input stream. Think HOP's stream lookup plugin does something like this in that it reads into memory one stream and then uses that to process the values (lookups) of another stream.

2 replies

usbrandon Jan 2, 2025
Collaborator

I agree with you here. What an incredible response. What we may need to ask @hansva and @bamaer about is what methods should we use to represent those feature vectors in Hop. I have seen them materialized as JSON with an array of floating point numbers. We might need to provide options. For example, a present bug/limitation is a JSON type was recently added to Hop, but even the JSON steps do not behave correctly with it because they were created before the base type was, and we have to therefore use Strings for JSON. Where I am going with that is if we introduce an output to a vector database we need to be able to convert between formats so we can feed the vector database what it expects, but also retrieve from it and output to a format humans can deal with, which is probably JSON.
At least all of the Python scripts out there seem to focus on loading JSON into lists or dictionaries for further inspection. Those lists and dictionaries and in turn be given to Pandas and scikit-learn for deeper inspection by the user.

I love your idea of splitting out the different activities into their own plugins/steps within a single plugin.

vdwals Jan 2, 2025
Author

Hi all,

First of all, thank you for your detailed and expressive response.

As background to my plugin: We had an actual request from a customer that I wanted to address using HOP to demonstrate its benefits. The request was to identify ingredients in recipes and check for allergies, among other things. Since the same ingredient can have multiple names, and the free sources available compile them in various languages, the most practical solution was to use semantic search with LangChain and Ollama as a local model—primarily due to GDPR concerns, as usual.

To achieve this, I started by cloning the stream lookup and tackled the task holistically. Once the initial implementation was successful, I began exploring additional scenarios where LLM features could benefit my customers. These included:

Similar Search
Structured Data Extraction from Text (e.g., Table Extraction): Here I faced scraping several web sites and extracting like service names and ports from different pages. I discovered that LLMs are great at this.
Classification: Classifying texts based on a predefined set of classes from a database, another stream, or constants.

With these use cases in mind, and after satisfying my customer's requirements, I began restructuring the plugin to make it more reusable and flexible. This included extracting metadata into specific, expandable types. However, I encountered challenges, particularly since vector storages lacked native support. Integrating Neo4J as a vector storage took more effort than anticipated, as I wanted to avoid duplicating metadata and implementation. Additionally, making the metadata types easily extensible required considerable effort.

I was also thinking about separating embedding creation, storage, and retrieval to better handle data scaling and support asynchronous processing for embedding and lookups. This is an area I’d love to focus on further.

ep9io · 2025-01-02T17:17:10Z

ep9io
Jan 2, 2025

@usbrandon Only to extend on the cosine similarity, that could fit into the existing HOP's calculator step? It already has other distance measuring calculations in there such as hamming and levenshtein. Cosine similarity could be added to that list, might not need to be created in another plugin.

Below is a pseudo-code of the function. Of course, it might not perform as fast as a database that has it better implemented or in C++/Rust.

FUNCTION cosine_similarity(vector1, vector2):
    dot_product = 0
    magnitude1 = 0
    magnitude2 = 0
    
    FOR i FROM 0 TO LENGTH(vector1) - 1:
        dot_product += vector1[i] * vector2[i]
        magnitude1 += vector1[i]^2
        magnitude2 += vector2[i]^2
    
    magnitude1 = SQRT(magnitude1)
    magnitude2 = SQRT(magnitude2)
    
    IF magnitude1 == 0 OR magnitude2 == 0:
        RETURN 0  # Avoid division by zero
    
    RETURN dot_product / (magnitude1 * magnitude2)

The similarity collection would then look like this (see below). Not sure where to fit this into HOP existing features. The whole lot of it can be done within the javascript/groovy step, but not ideal.

# Store similarities
similarities = []

FOR data_point IN data_points:
    similarity = cosine_similarity(query, data_point)
    ADD similarity TO similarities

# Find the most similar
max_similarity = -1
most_similar_index = -1

FOR i FROM 0 TO LENGTH(similarities) - 1:
    IF similarities[i] > max_similarity:
        max_similarity = similarities[i]
        most_similar_index = i

Perhaps java's Vector API (if enabled) could make this into something more practical within HOP and thus relying less on external software.

1 reply

usbrandon Jan 2, 2025
Collaborator

Yes, I think adding these difference calculations to the Calculator step make sense. Another step that does calculation is the "Formula" step that we could also consider. The issue of what are the incoming data types and outgoing data types are the issue. Having a separate query step let's the user control how many rows would return too, which implies that the query step would have whatever the vector database is doing those similarity / difference calculations to determine what feature vectors to return.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further LLM Support #4732

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Further LLM Support #4732

vdwals Dec 19, 2024

Replies: 5 comments · 4 replies

bamaer Dec 19, 2024 Collaborator

vdwals Dec 20, 2024 Author

ep9io Jan 2, 2025

usbrandon Jan 2, 2025 Collaborator

ep9io Jan 2, 2025

Plugin 1: Create the embeddings. In other words, transform text into number arrays.

Plugin 2 (concept): Store the embeddings

Plugin 3 (general concept): The inverse of plugin 2, the input transform.

Plugin 4: Search/Similarity/etc.

usbrandon Jan 2, 2025 Collaborator

vdwals Jan 2, 2025 Author

ep9io Jan 2, 2025

usbrandon Jan 2, 2025 Collaborator

vdwals
Dec 19, 2024

Replies: 5 comments 4 replies

bamaer
Dec 19, 2024
Collaborator

vdwals Dec 20, 2024
Author

ep9io
Jan 2, 2025

usbrandon
Jan 2, 2025
Collaborator

ep9io
Jan 2, 2025

usbrandon Jan 2, 2025
Collaborator

vdwals Jan 2, 2025
Author

ep9io
Jan 2, 2025

usbrandon Jan 2, 2025
Collaborator