You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text stored in Fondant datasets is in a string format, however, oftentimes we need to properly render it to a readable format by the user:
Properly show basic formatting like newlines etc.
Show html documents
Show pdf documents
Show json
We can do some basic detection on the text, by default this would be a raw text that can be rendered to a pdf. Otherwise, we can check if it's an html document or a json. The user can click on a dataset row or maybe even search by id to find the document. An example on how this can look like:
Screencast.2023-11-15.17.19.18.mp4
We can either display the documents in the same tab or a separate tab. Other things to text would be the ability to properly reconstruct a pdf from a string. For that, we would probably have to develop a pdf/document loader component first to check the required representation needed when serializing the document.
The text was updated successfully, but these errors were encountered:
The original pdfs we can either render it back in the viewer from:
The raw text extracted from the llama index loaders. Advantages is that it's lightweight and does not require extra work. Though most formatting will be lost (document structure, colors, tables, ...)
We encode it as a base64 string in the loading document component and store that info in an extra column, this would enable us to render it back. The disadvantages here is that it's less lightweight as we're storing the data twice in different formats (llama doc + base64)
For now we can support both options. (we can check if the string is a pdf encoded bytestring)
The text stored in Fondant datasets is in a string format, however, oftentimes we need to properly render it to a readable format by the user:
We can do some basic detection on the text, by default this would be a raw text that can be rendered to a pdf. Otherwise, we can check if it's an html document or a json. The user can click on a dataset row or maybe even search by id to find the document. An example on how this can look like:
Screencast.2023-11-15.17.19.18.mp4
We can either display the documents in the same tab or a separate tab. Other things to text would be the ability to properly reconstruct a pdf from a string. For that, we would probably have to develop a pdf/document loader component first to check the required representation needed when serializing the document.
The text was updated successfully, but these errors were encountered: