Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Loader for JVM based on Apache pdf-box #74

Merged
merged 3 commits into from
May 18, 2023
Merged

PDF Loader for JVM based on Apache pdf-box #74

merged 3 commits into from
May 18, 2023

Conversation

raulraja
Copy link
Contributor

@raulraja raulraja commented May 17, 2023

This PR includes several changes focused on refactoring the BaseLoader interface and adding support for PDF document loading.

Refactoring of BaseLoader interface: The loadAndSplit method in the BaseLoader interface has been refactored. The method now includes a default implementation that splits the documents after loading them. This change simplifies the implementation of the BaseLoader interface in its subclasses, removing the need to override the loadAndSplit method in each subclass.

Removal of loadAndSplit method in ScrapeURLTextLoader and TextLoader: Following the refactoring of the BaseLoader interface, the loadAndSplit method has been removed from the ScrapeURLTextLoader and TextLoader classes as it is no longer necessary.

Addition of PDF support: A new module xef-pdf has been added to the project. This module includes the PDFLoader class, which implements the BaseLoader interface for loading PDF documents. The pdf function is provided to create a ParameterlessAgent that loads and splits the content of a PDF file based on Apache PDF box.

Example of PDF document loading: An example of using the new PDF document loading functionality has been added to the example module. The example loads a PDF document and uses an AI model to answer questions about the content of the document.

package com.xebia.functional.xef.auto

import com.xebia.functional.xef.pdf.pdf
import kotlinx.serialization.Serializable
import java.io.File

@Serializable
data class AIResponse(val answer: String, val source: String)

suspend fun main() = ai {
  val file = AIResponse::class.java.getResource("/documents/doc.pdf").file
  contextScope(pdf(file = File(file))) {
    while (true) {
      print("Enter your question: ")
      val line = readlnOrNull() ?: break
      val response: AIResponse = prompt(line)
      println("${response.answer}\n---\n${response.source}\n---\n")
    }
  }
}.getOrThrow()

Addition of a PDF document to the example module resources: A PDF document has been added to the resources of the example module for use in the new example of PDF document loading.

The pdf for the example is this one https://xebia.com/wp-content/uploads/2022/01/Document-Post-Interview-Filip-Chyla-small.pdf, where Filip talks about why he likes working at Xebia.

23:00:35.867 [main] DEBUG AutoAI -- [Get PDF content] Running
23:00:37.771 [DefaultDispatcher-worker-12] DEBUG AutoAI -- [Get PDF content] Found and memorized 17 docs
Enter your question: what is this content about?
The content is about Filip Chyla's experience as a new hire at Xebia, including the recruitment process, the company's guiding principles, and the freedom to develop in the direction of his choice.
---
Xebia Security website
---

Enter your question: How does Filip feel about working at Xebia?
Working here gives me the freedom to develop in the direction of my choice.
---
Filip Chyla, Xebia Security
---

Enter your question: What is Xebia?
Xebia is a company that values principles such as People First, Sharing Knowledge, Quality Without Compromise, and Customer Intimacy. They prioritize growth and development of their employees by allowing them to select their own assignments and work in the direction of their choice.
---
Filip Chyla's experience at Xebia
---

@raulraja
Copy link
Contributor Author

@xebia-functional/team-ai ready for review

Copy link
Contributor

@franciscodr franciscodr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment. Thanks, @raulraja

Co-authored-by: Francisco Diaz <francisco.d@47deg.com>
@raulraja raulraja merged commit 37c122d into main May 18, 2023
@raulraja raulraja deleted the pdf-loader branch May 18, 2023 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants