Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable better view of documents #631

Closed
Tracked by #568
PhilippeMoussalli opened this issue Nov 15, 2023 · 2 comments
Closed
Tracked by #568

Enable better view of documents #631

PhilippeMoussalli opened this issue Nov 15, 2023 · 2 comments
Assignees

Comments

@PhilippeMoussalli
Copy link
Contributor

PhilippeMoussalli commented Nov 15, 2023

The text stored in Fondant datasets is in a string format, however, oftentimes we need to properly render it to a readable format by the user:

  • Properly show basic formatting like newlines etc.
  • Show html documents
  • Show pdf documents
  • Show json

We can do some basic detection on the text, by default this would be a raw text that can be rendered to a pdf. Otherwise, we can check if it's an html document or a json. The user can click on a dataset row or maybe even search by id to find the document. An example on how this can look like:

Screencast.2023-11-15.17.19.18.mp4

We can either display the documents in the same tab or a separate tab. Other things to text would be the ability to properly reconstruct a pdf from a string. For that, we would probably have to develop a pdf/document loader component first to check the required representation needed when serializing the document.

@PhilippeMoussalli
Copy link
Contributor Author

PhilippeMoussalli commented Nov 16, 2023

The original pdfs we can either render it back in the viewer from:

  • The raw text extracted from the llama index loaders. Advantages is that it's lightweight and does not require extra work. Though most formatting will be lost (document structure, colors, tables, ...)
  • We encode it as a base64 string in the loading document component and store that info in an extra column, this would enable us to render it back. The disadvantages here is that it's less lightweight as we're storing the data twice in different formats (llama doc + base64)

For now we can support both options. (we can check if the string is a pdf encoded bytestring)

@PhilippeMoussalli PhilippeMoussalli self-assigned this Nov 22, 2023
@janvanlooyml6 janvanlooyml6 moved this from Backlog to In Progress in Fondant development Nov 24, 2023
@PhilippeMoussalli PhilippeMoussalli moved this from In Progress to Done in Fondant development Nov 24, 2023
@RobbeSneyders
Copy link
Member

Fixed in #666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants