Enable better view of documents #631

PhilippeMoussalli · 2023-11-15T10:05:15Z

The text stored in Fondant datasets is in a string format, however, oftentimes we need to properly render it to a readable format by the user:

Properly show basic formatting like newlines etc.
Show html documents
Show pdf documents
Show json

We can do some basic detection on the text, by default this would be a raw text that can be rendered to a pdf. Otherwise, we can check if it's an html document or a json. The user can click on a dataset row or maybe even search by id to find the document. An example on how this can look like:

Screencast.2023-11-15.17.19.18.mp4

We can either display the documents in the same tab or a separate tab. Other things to text would be the ability to properly reconstruct a pdf from a string. For that, we would probably have to develop a pdf/document loader component first to check the required representation needed when serializing the document.

PhilippeMoussalli · 2023-11-16T15:52:31Z

The original pdfs we can either render it back in the viewer from:

The raw text extracted from the llama index loaders. Advantages is that it's lightweight and does not require extra work. Though most formatting will be lost (document structure, colors, tables, ...)
We encode it as a base64 string in the loading document component and store that info in an extra column, this would enable us to render it back. The disadvantages here is that it's less lightweight as we're storing the data twice in different formats (llama doc + base64)

For now we can support both options. (we can check if the string is a pdf encoded bytestring)

RobbeSneyders · 2023-11-30T22:56:25Z

Fixed in #666

PhilippeMoussalli mentioned this issue Nov 15, 2023

Extend data explorer for document-based data #568

Closed

github-project-automation bot added this to Fondant development Nov 15, 2023

github-project-automation bot moved this to Backlog in Fondant development Nov 15, 2023

PhilippeMoussalli self-assigned this Nov 22, 2023

janvanlooyml6 moved this from Backlog to In Progress in Fondant development Nov 24, 2023

PhilippeMoussalli moved this from In Progress to Done in Fondant development Nov 24, 2023

RobbeSneyders closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable better view of documents #631

Enable better view of documents #631

PhilippeMoussalli commented Nov 15, 2023 •

edited

Loading

PhilippeMoussalli commented Nov 16, 2023 •

edited

Loading

RobbeSneyders commented Nov 30, 2023

Enable better view of documents #631

Enable better view of documents #631

Comments

PhilippeMoussalli commented Nov 15, 2023 • edited Loading

PhilippeMoussalli commented Nov 16, 2023 • edited Loading

RobbeSneyders commented Nov 30, 2023

PhilippeMoussalli commented Nov 15, 2023 •

edited

Loading

PhilippeMoussalli commented Nov 16, 2023 •

edited

Loading