Further LLM Support #4732
Replies: 5 comments 4 replies
-
sounds great, the more functionality the better IMO. Can you check your link? It returns a 404, could it be that your repository is set to private? |
Beta Was this translation helpful? Give feedback.
-
Hi @vdwals, I contributed to the language model chat plugin. 😊 I had a look through your GitHub project and was pleased to see you've implemented a plugin for embeddings. I initially started working on embeddings (for vector searching) and language model APIs (for prompting). My project required more focus on the prompting side, so I managed to complete that plugin, but I left the embeddings plugin unfinished. In my prototype, I used it to compute embeddings using BERT (hard coded). Your version looks far more complete! Regarding "embeddingmodels", as @bamaer said, the more functionality the better. Langchain4J's embeddings would be an excellent addition, and I'd love to try it out someday. Regarding "languagemodels", I noticed that there is some overlap with the existing plugin (in HOP). As your plugin seems to currently support only Ollama, it might be best to integrate any new functionality that your code has into the existing plugin that supports more than just Ollama. I’d be happy to assist with that if needed! Regarding "embeddingstores", that complements the embeddingmodels from what I can see and it has support for in-memory and neo4j, which is great. Regarding "semanticsearch", from a quick glance, that seems to be the plugin that reuses the "embeddingsmodel" meta for obtaining the model to create embeddings from text and the "embeddingstores" for storing those embeddings (vectors). Then, submitting the search request to the store and outputting the relevant matches along with their scores. I do have some follow up questions, but I can ask them in a separate comment. Regarding "tableextraction", I only glanced at it quickly and it appears to be a domain specific task, but has potential to be turned into something others can use. It prompts a model and returns a result, with the result being some form of translation from json extracts into value types. Sounds like something @bjackson-ep was interested in or working with, from what he discussed with me at one stage. Thanks for sharing your repository and the work you've been doing, it's all interesting. |
Beta Was this translation helpful? Give feedback.
-
Hi @vdwals (Dennis), I am happy to see you and Tristan contributing more innovation and thoughts in this area. I am super interested in your work and furthering developments in these areas. I agree with Tristan to try to make any action/step as friendly as possible to using the popular APIs & Ollama. I will check out your github repo and try what you've created in the latest Hop. As far as embeddings go, I am very interested in the semantic search nature of creating and using embeddings. Right now it seems like there are two separate tasks. Creating embeddings and then applying one as a search query for some kind of top n responses of closest matchs, like the minimum cosine difference of the responses. Since it may not just be a matter of putting a single approach to work, what could we do to help people have a fuller exploratory cycle so they can understand what is happening? For example, the way we can us t-SNE to reduce the dimensionality of the embeddings to put them on a 2d or 3d plot. https://www.datacamp.com/tutorial/introduction-t-sne Areas where I am trying to apply these in the real world are for mastering data. Say you have rows from several systems representing a person or product, how can we automatically steward those to come to a reasonable conclusion that they are similar enough to be called 'a person' or 'a product'? That way we can generate durable (master keys) for them, and the data become joinable across different systems. To do things like that, but also referring back to embeddings, I am curious how knowledge graphs help that along. The embeddings themselves we know are in high dimensional space, but the graphs, as I currently know, they may represent an ontology, but I am not sure how at present others encode the semantic meaning that exist due to the shape of the graph into an embedding that models will understand. My examples are super simple, but perhaps they will inspire some ideas. Probably over this next week I will start populating more and more ideas for the data quality approaches. My first inclination there was to use the function calling abilities of the LLM to provide guarantees about the layout of the outputs. That way we could bring questions into the model and get a predictable output JSON structure and convert that to a tabular form Hop understands. What kinds of things are you trying to solve using these technologies? I will take a deeper look at your work this week and reply back more concretely on the plugins. Again. Thank you for putting so much thought and effort into these things. We are shaping the future for that experience for many users who will encounter Hop so the thought and time is worth it. Warmly, Brandon |
Beta Was this translation helpful? Give feedback.
-
I can share the approach I was taking with the use of LangChain4J's embeddings and search, and see if it makes sense or helps with a direction. The overall idea is to split the tasks into independent plugins so that they can be reused more broadly and leverage other plugins within the HOP's ecosystem.
Plugin 1: Create the embeddings. In other words, transform text into number arrays.Inputs: text field and model details This plugin selects an input field as the input value (e.g. text value) and passes it through the model (typically an encoder like bert). What you get back is a dense set (embeddings). They might also be called feature vectors, word vectors, latent space, and to some extent logits. They can look like this:
Your plugin already does this in the line highlighted below:
These numbers on their own are quite useful and the output of this plugin is only to produce these arrays and nothing else. Think of it as a pre-processing step. Attaching a parquet/csv/json output file after this plugin has many applications. This plugin alone can be used to create a dataset that can be used or reused further down the workflow. In other words, once you create the embeddings, you don't need to be recreating them every time you are running a task. The pipelines or external tasks further down the workflow could be reading a parquet file that contains these embeddings. Plugin 2 (concept): Store the embeddingsInputs: array of numbers Your plugin implements this concept:
Saving to a text file or parquet is one option that's already out of the box in HOP, and quite useful. This plugin would be similar to existing output HOP transforms in that it receives inputs (vectors) and persists them somewhere designed for vectors. The in-memory option probably won't make much sense here, but it can in plugin 4. Plugin 3 (general concept): The inverse of plugin 2, the input transform.Inputs: file location or address, depending on the database type. Similar to HOP's Text File, json, parquet, DBs, it reads from a source and outputs the values from that source. I haven't used Neo4J's option so not sure if that can be reused, but for other sources such as Milvus (or whatever vector database), something would need to be created. All that is not out of the box in HOP (e.g. parquet, text, etc.) would fall under this plugin. Plugin 4: Search/Similarity/etc.Inputs: embedding ids, Array of numbers, model details, and perhaps searching parameters. This is pretty much what your code does:
As for cosine similarity and t-SNE, as @usbrandon mentioned, I would leave those out. Linear methods such as PCA and non-linear like t-SNE etc. are other forms of (pre/post) processing steps (e.g. analysis, dimensionality reduction) and perhaps left for another HOP plugin or an external process such as python to handle. For example, those can be done against the embeddings dataset that was produced by plugin 1 via python. Cosine similarity is one of the most common use functions in that it measures the direction of the vectors. However, there are other similarity search functions (euclidean, dot product, etc.) and I would leave that up to the vector database or external tool to handle. They'll do a better job than something implemented in HOP. For example, they might use the GPU for better performance (e.g. Meta's Faiss). However, plugin 4 could allow for extra parameters that are passed to the database/tool (e.g. to use a different function). Regarding in-memory, it can be tricky for it to scale if it's large, but still serves a useful purpose. Your plugin already has it and there are other forms as well. I was trying to fit Meta's Faiss into HOP, but abandoned it and instead used it as an in-memory similarity search (using euclidean distance) via Python. Plugin 4 could allow to read an input stream into memory and then do the searching using the values from another input stream. Think HOP's stream lookup plugin does something like this in that it reads into memory one stream and then uses that to process the values (lookups) of another stream. |
Beta Was this translation helpful? Give feedback.
-
@usbrandon Only to extend on the cosine similarity, that could fit into the existing HOP's calculator step? It already has other distance measuring calculations in there such as hamming and levenshtein. Cosine similarity could be added to that list, might not need to be created in another plugin. Below is a pseudo-code of the function. Of course, it might not perform as fast as a database that has it better implemented or in C++/Rust.
The similarity collection would then look like this (see below). Not sure where to fit this into HOP existing features. The whole lot of it can be done within the javascript/groovy step, but not ideal.
Perhaps java's Vector API (if enabled) could make this into something more practical within HOP and thus relying less on external software. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I greatly appreciate the recent transformations added for LLM support. I started a similar implementation for customer projects and created this plugin: GitHub Link. Especially the "semantic search" was a very useful feature.
In my implementation, I introduced three new metadata types specifically tailored for LLM use cases. Now, I’m considering whether it would be beneficial to integrate my work into the main codebase and align the metadata usage, or if it’s better to continue developing it as a standalone plugin.
I’d love to hear your thoughts on this.
Best regards,
Dennis
Beta Was this translation helpful? Give feedback.
All reactions