Skip to content

A thin C and Rust wrappers over `mutool convert` that extract text from pdf into in-memory buffer.

Notifications You must be signed in to change notification settings

kkew3/muconvert_rust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

muconvert: A thin C and Rust wrapper over mutool convert that extracts text from pdf

Introduction

mutool convert from https://mupdf.com/ seems to always write extracted text to a file. So I adapt its code a bit to read pdf text into an unsigned char buffer. I further add a thin Rust wrapper so that the same can be done from Rust side. All C and Rust sum up to less than 300 lines of codes.

Why another pdf extractor

I'm aware of a number of awesome Rust pdf text extractors like pdf-extract. There's also a rust binding of mupdf here. However, these libraries appear too large for my use case. That's the main reason why I start this project.

Build C

Please refer to the Makefile.

Build Rust binding

Not a perfect procedure, but basically:

  1. Install mupdf (e.g., on macOS, brew install mupdf)
  2. Adjust the Makefile to your platform (mine is macOS). You may also need to adjust build.rs.
  3. Run make dylib. This should produce libmuconvert.dylib on macOS.
  4. Run cargo build.

C usage

See muconvert-cli.c as an example.

Rust usage

let filename = "hello.pdf";
// Assign a large enough buffer for the pdf (NOTE below),
// and you'll be fine.
// A possible heuristic for the buffer size is the file
// size in bytes.
let buf: Vec<u8> = vec![0; 103977368];
let text = muconvert_rust::pdftotext(filename, false, true, buf)?;

NOTE: I know this is a bit awkward. Currently, in case of a buffer too small error, here is a possible solution:

use muconvert_rust::{Error, pdftotext};

let filename = "hello.pdf";
// A very small buffer.
let buf: Vec<u8> = vec![0; 100];
match pdftotext(filename, false, true, buf) {
    // Do something with the extracted text.
    Ok(text) => (),
    // In case of the buffer too small error,
    Err(Error::BufferTooSmall(len, mut buf)) => {
        // Extend the buffer.
        buf.resize(len, 0);
        // Retry, and do something with the text.
        let text = pdftotext(filename, false, true, buf).unwrap();
    }
    // Handle other errors.
    _ => (),
}

But I haven't yet found a way to keep the wrapper simple while avoiding such issue. Suggestions are welcome!

About

A thin C and Rust wrappers over `mutool convert` that extract text from pdf into in-memory buffer.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published