From 1a58bb8faf7e7f523cbec26a570203de76d684f8 Mon Sep 17 00:00:00 2001 From: "Thomas J. Leeper" Date: Fri, 3 Jun 2016 11:19:12 +0100 Subject: [PATCH] completely redo locate_areas() (#8) --- DESCRIPTION | 8 +-- NEWS | 4 ++ R/extract_tables.R | 8 ++- R/locate_area.R | 161 +++++++++++++++++++++++++++++------------- R/utils.R | 6 +- man/extract_areas.Rd | 17 +++-- man/extract_tables.Rd | 8 ++- 7 files changed, 149 insertions(+), 63 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 21a0303..c37de93 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,8 +1,8 @@ Package: tabulizer Type: Package Title: Bindings for Tabula PDF Table Extractor Library -Version: 0.1.17 -Date: 2016-06-01 +Version: 0.1.18 +Date: 2016-06-03 Authors@R: c(person("Thomas J.", "Leeper", role = c("aut", "cre"), email = "thosjleeper@gmail.com"), person("David", "Gohel", role = "ctb", @@ -14,14 +14,14 @@ License: MIT + file LICENSE URL: https://github.com/leeper/tabulizer BugReports: https://github.com/leeper/tabulizer/issues Imports: - graphics, - grDevices, utils, tools, tabulizerjars, rJava, png Suggests: + graphics, + grDevices, testthat, knitr RoxygenNote: 5.0.1 diff --git a/NEWS b/NEWS index d09b3f9..272a23a 100644 --- a/NEWS +++ b/NEWS @@ -1,3 +1,7 @@ +# CHANGES TO tabulizer 0.1.18 # + +* Completely rewrite the `locate_areas()` interface to rely on graphics device event handling where possible. This may behave differently across platforms or in RStudio. (#8) + # CHANGES TO tabulizer 0.1.17 # * Fixed a bug in `extract_tables()` such that when no tables are found, an empty list is returned (for `method` values with list response structures). (h/t Lincoln Mullen) diff --git a/R/extract_tables.R b/R/extract_tables.R index 491212c..2b92607 100644 --- a/R/extract_tables.R +++ b/R/extract_tables.R @@ -10,7 +10,13 @@ #' @param password Optionally, a character string containing a user password to access a secured PDF. #' @param encoding Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of \code{\link[base]{Encoding}}. #' @param \dots These are additional arguments passed to the internal functions dispatched by \code{method}. -#' @details This function mimics the behavior of the Tabula command line utility. It returns a list of R character matrices containing tables extracted from a file by default. This response behavior can be changed by using the following options. \code{method = "character"} returns a list of single-element character vectors, where each vector is a tab-delimited, line-separate string of concatenated table cells. \code{method = "data.frame"} attempts to coerce the structure returned by \code{method = "character"} into a list of data.frames and returns character strings where this fails. \code{method = "csv"} writes the tables to comma-separated (CSV) files using Tabula's CSVWriter method in the same directory as the original PDF. \code{method = "tsv"} does the same but with tab-separated (TSV) files using Tabula's TSVWriter and \code{method = "json"} does the same using Tabula's JSONWriter method. The previous three methods all return the path to the directory containing the extract table files. \code{method = "asis"} returns the Java object reference, which can be useful for debugging or for writing a custom parser. +#' @details This function mimics the behavior of the Tabula command line utility. It returns a list of R character matrices containing tables extracted from a file by default. This response behavior can be changed by using the following options. +#' \itemize{ +#' \item \code{method = "character"} returns a list of single-element character vectors, where each vector is a tab-delimited, line-separate string of concatenated table cells. +#' \item \code{method = "data.frame"} attempts to coerce the structure returned by \code{method = "character"} into a list of data.frames and returns character strings where this fails. +#' \item \code{method = "csv"} writes the tables to comma-separated (CSV) files using Tabula's CSVWriter method in the same directory as the original PDF. \code{method = "tsv"} does the same but with tab-separated (TSV) files using Tabula's TSVWriter and \code{method = "json"} does the same using Tabula's JSONWriter method. Any of these three methods return the path to the directory containing the extract table files. +#' \item \code{method = "asis"} returns the Java object reference, which can be useful for debugging or for writing a custom parser. +#' } #' \code{\link{extract_areas}} implements this functionality in an interactive mode allowing the user to specify extraction areas for each page. #' @return By default, a list of character matrices. This can be changed by specifying an alternative value of \code{method} (see Details). #' @references \href{http://tabula.technology/}{Tabula} diff --git a/R/locate_area.R b/R/locate_area.R index e73301e..0c44865 100644 --- a/R/locate_area.R +++ b/R/locate_area.R @@ -3,11 +3,18 @@ #' @description Interactively identify areas and extract #' @param file A character string specifying the path to a PDF file. This can also be a URL, in which case the file will be downloaded to the R temporary directory using \code{download.file}. #' @param pages An optional integer vector specifying pages to extract from. To extract multiple tables from a given page, repeat the page number (e.g., \code{c(1,2,2,3)}). -#' @param silent A logical indicating whether to silence the \code{\link[graphics]{locator}} function. +#' @param resolution An integer specifying the resolution of the PNG images conversions. A low resolution is used by default to speed image loading. #' @param guess See \code{\link{extract_tables}} (note the different default value). #' @param \dots Other arguments passed to \code{\link{extract_tables}}. -#' @details \code{extract_areas} is an interactive mode for \code{\link{extract_tables}} allowing the user to specify areas of each PDF page in a file that they would like extracted. In interactive mode, each page is rendered to a PNG file and displayed in an R graphics window sequentially, pausing on each page to call \code{\link[graphics]{locator}} so the user can specify two points (e.g., upper-left and lower-right) to define bounds of page area. \code{extract_areas} then passes these user-defined areas to \code{\link{extract_tables}}. \code{locate_areas} implements the interactive component only, without actually extracting; this might be useful for interactive work that needs some modification before executing \code{extract_tables} computationally. -#' @note Currently, attempting to resize the graphics window at any point during this process will cause problems. +#' @details \code{extract_areas} is an interactive mode for \code{\link{extract_tables}} allowing the user to specify areas of each PDF page in a file that they would like extracted. When used, each page is rendered to a PNG file and displayed in an R graphics window sequentially, pausing on each page to call \code{\link[graphics]{locator}} so the user can click and highlight an area to extract. +#' +#' The exact behaviour is a somewhat platform-dependent. If graphics events are supported, then it is possibly to interactively highlight a page region, make changes to that region, and navigate through the pages of the document while retaining the area highlighted on each page. If graphics events are not supported (e.g., in RStudio), then some of this functionality is not available (see below). +#' +#' In \emph{full functionality mode}, the first mouse click initializes a highlighting rectangle; the second click confirms it. If unsatisfied with the selection, the process can be repeated. The window also responds to keystrokes. \kbd{PgDn}, \kbd{Right}, and \kbd{Down} advance to the next page image, while \kbd{PgUp}, \kbd{Left}, and \kbd{Up} return to the previous page image. \kbd{Home} returns to the first page image and \kbd{End} advances to the final page image. \kbd{Q} quits the interactive mode and proceeds with extraction. When navigating between pages, any selected areas will be displayed and can be edited. \kbd{Delete} removes a highlighted area from a page (and then displays it again). +#' +#' In \emph{reduced functionality mode}, the interface requires users to indicate the upper-left and lower-right (or upper-right and lower-left) corners of an area on each page, this area will be briefly confirmed with a highlighted rectangle and the next page will be displayed. Dynamic page navigation and area editing are not possible. +#' +#' In either mode, after the areas are selected, \code{extract_areas} passes these user-defined areas to \code{\link{extract_tables}}. \code{locate_areas} implements the interactive component only, without actually extracting; this might be useful for interactive work that needs some modification before executing \code{extract_tables} computationally. #' @return For \code{extract_areas}, see \code{\link{extract_tables}}. For \code{locate_areas}, a list of four-element numeric vectors (top,left,bottom,right), one per page of the file. #' @author Thomas J. Leeper #' @examples @@ -28,7 +35,7 @@ #' @importFrom grDevices dev.capabilities dev.off #' @importFrom graphics par rasterImage locator plot #' @export -locate_areas <- function(file, pages = NULL, silent = TRUE) { +locate_areas <- function(file, pages = NULL, resolution = 60L) { if (!interactive()) { stop("locate_areas() is only available in an interactive session") } else { @@ -39,15 +46,38 @@ locate_areas <- function(file, pages = NULL, silent = TRUE) { file <- localize_file(file, copy = TRUE) on.exit(unlink(file), add = TRUE) dims <- get_page_dims(file, pages = pages) - paths <- make_thumbnails(file, outdir = tempdir(), pages = pages, format = "png") + paths <- make_thumbnails(file, outdir = tempdir(), pages = pages, format = "png", resolution = resolution) on.exit(unlink(paths), add = TRUE) - areas <- list() - for (i in seq_along(paths)) { + areas <- rep(list(NULL), length(paths)) + i <- 1 + warnThisTime <- TRUE + while (TRUE) { if (!is.na(paths[i])) { - areas[[i]] <- try_area(file = paths[[i]], dims = dims[[i]]) - } else { - areas[[i]] <- NA_real_ + a <- try_area(file = paths[i], dims = dims[[i]], area = areas[[i]], warn = warnThisTime) + warnThisTime <- FALSE + if (!is.null(a[["area"]])) { + areas[[i]] <- a[["area"]] + } + if (tolower(a[["key"]]) %in% c("del", "delete", "ctrl-h")) { + areas[i] <- list(NULL) + next + } else if (tolower(a[["key"]]) %in% c("home")) { + i <- 1 + next + } else if (tolower(a[["key"]]) %in% c("end")) { + i <- length(paths) + next + } else if (tolower(a[["key"]]) %in% c("pgup", "page_up", "up", "left")) { + i <- if (i == 1) 1 else i - 1 + next + } else if (tolower(a[["key"]]) %in% c("q")) { + break + } + } + i <- i + 1 + if (i > length(paths)) { + break } } return(areas) @@ -60,43 +90,78 @@ extract_areas <- function(file, pages = NULL, guess = FALSE, ...) { extract_tables(file = file, pages = pages, area = areas, guess = guess, ...) } -try_area <- function(file, dims) { - deviceCoord <- "nfc" - cairoDevice::Cairo(width = dims[1], height = dims[2], pointsize = 12, surface = "screen") - if (dev.capabilities()[["rasterImage"]] != "yes") { - stop("Graphics device does not support rasterImage plotting") +try_area <- function(file, dims, area = NULL, warn = FALSE) { + deviceUnits <- "nfc" + if (Sys.info()["sysname"] == "Darwin") { + grDevices::X11(type = "xlib") + } + if (grDevices::dev.capabilities()[["rasterImage"]] != "yes") { + stop("Graphics device does not support rasterImage() plotting") } thispng <- readPNG(file, native = TRUE) drawPage <- function() { graphics::plot(c(0, dims[1]), c(0, dims[2]), type = "n", xlab = "", ylab = "", asp = 1) graphics::rasterImage(thispng, 0, 0, dims[1], dims[2]) } - - pre_par <- par(mar=c(0,0,0,0), xaxs = "i", yaxs = "i", bty = "n") - on.exit(par(pre_par), add = TRUE) + drawRectangle <- function() { + if (!is.null(endx)) { + graphics::rect(startx, starty, endx, endy, col = grDevices::rgb(1,0,0,.2) ) + } + } + + pre_par <- graphics::par(mar=c(0,0,0,0), xaxs = "i", yaxs = "i", bty = "n") + on.exit(graphics::par(pre_par), add = TRUE) drawPage() - on.exit(dev.off(), add = TRUE) - + on.exit(grDevices::dev.off(), add = TRUE) + + if (!length(grDevices::dev.capabilities()[["events"]])) { + if (warn) { + message("Graphics device does not support event handling...\n", + "Entering reduced functionality mode.\n", + "Click upper-left and then lower-right corners of area.") + } + tmp <- locator(2) + graphics::rect(tmp$x[1], tmp$y[1], tmp$x[2], tmp$y[2], col = grDevices::rgb(1,0,0,.5)) + Sys.sleep(2) + + # convert to: top,left,bottom,right + area <- c(dims[2] - max(tmp$y), min(tmp$x), dims[2] - min(tmp$y), max(tmp$x)) + return(list(key = "right", area = area)) + } + clicked <- FALSE - startx <- 0 - starty <- 0 - endx <- 0 - endy <- 0 + lastkey <- NA_character_ + if (!length(area)) { + startx <- NULL + starty <- NULL + endx <- NULL + endy <- NULL + } else { + showArea <- function() { + # convert from: top,left,bottom,right + startx <<- area[2] + starty <<- dims[2] - area[1] + endx <<- area[4] + endy <<- dims[2] - area[3] + drawRectangle() + } + showArea() + } devset <- function() { - if (dev.cur() != eventEnv$which) dev.set(eventEnv$which) + if (grDevices::dev.cur() != eventEnv$which) grDevices::dev.set(eventEnv$which) } mousedown <- function(buttons, x, y) { devset() if (clicked) { - endx <<- graphics::grconvertX(x, deviceCoord, "user") - endy <<- graphics::grconvertY(y, deviceCoord, "user") + endx <<- graphics::grconvertX(x, deviceUnits, "user") + endy <<- graphics::grconvertY(y, deviceUnits, "user") clicked <<- FALSE eventEnv$onMouseMove <- NULL } else { - startx <<- graphics::grconvertX(x, deviceCoord, "user") - starty <<- graphics::grconvertY(y, deviceCoord, "user") + startx <<- graphics::grconvertX(x, deviceUnits, "user") + starty <<- graphics::grconvertY(y, deviceUnits, "user") clicked <<- TRUE eventEnv$onMouseMove <- dragmousemove } @@ -106,10 +171,10 @@ try_area <- function(file, dims) { dragmousemove <- function(buttons, x, y) { devset() if (clicked) { - endx <<- graphics::grconvertX(x, "nfc", "user") - endy <<- graphics::grconvertY(y, "nfc", "user") + endx <<- graphics::grconvertX(x, deviceUnits, "user") + endy <<- graphics::grconvertY(y, deviceUnits, "user") drawPage() - graphics::rect(startx, starty, endx, endy, col = rgb(1,0,0,.2) ) + drawRectangle() } NULL } @@ -117,31 +182,27 @@ try_area <- function(file, dims) { keydown <- function(key) { devset() eventEnv$onMouseMove <- NULL + lastkey <<- key TRUE } - p <- "Click and drag to select a table area, press any key to confirm" - grDevices::setGraphicsEventHandlers( - prompt = p, - onMouseDown = mousedown, - onKeybd = keydown) + p <- "Click and drag to select a table area. Press for next page or to quit." + grDevices::setGraphicsEventHandlers(prompt = p, + onMouseDown = mousedown, + onKeybd = keydown) eventEnv <- grDevices::getGraphicsEventEnv() grDevices::getGraphicsEvent() backToPageSize <- function() { - width <- dims[1] - height <- dims[2] - x1 <- graphics::grconvertX(startx, "user", "nfc") - y1 <- graphics::grconvertY(starty, "user", "nfc") - x2 <- graphics::grconvertX(endx, "user", "nfc") - y2 <- graphics::grconvertY(endy, "user", "nfc") - # convert to: top,left,bottom,right - c(top = height - (max(c(y1, y2)) * height), - left = min(c(x1,x2)) * width, - bottom = height - (min(c(y1,y2)) * height), - right = max(c(x1,x2)) * width - ) + if (!is.null(startx)) { + c(top = dims[2] - max(c(starty, endy)), + left = min(c(startx,endx)), + bottom = dims[2] - (min(c(starty,endy))), + right = max(c(startx,endx)) ) + } else { + NULL + } } - return(backToPageSize()) + return(list(key = lastkey, area = backToPageSize())) } diff --git a/R/utils.R b/R/utils.R index 4bb41fd..7a084f3 100644 --- a/R/utils.R +++ b/R/utils.R @@ -60,7 +60,11 @@ make_area <- function(area = NULL, pages = NULL, npages = NULL) { } } area <- lapply(area, function(x) { - new(J("technology.tabula.Rectangle"), .jfloat(x[1]), .jfloat(x[2]), .jfloat(x[4]-x[2]), .jfloat(x[3]-x[1])) + if (!is.null(x)) { + new(J("technology.tabula.Rectangle"), .jfloat(x[1]), .jfloat(x[2]), .jfloat(x[4]-x[2]), .jfloat(x[3]-x[1])) + } else { + NULL + } }) } area diff --git a/man/extract_areas.Rd b/man/extract_areas.Rd index 5a36662..18b22de 100644 --- a/man/extract_areas.Rd +++ b/man/extract_areas.Rd @@ -5,7 +5,7 @@ \alias{locate_areas} \title{extract_areas} \usage{ -locate_areas(file, pages = NULL, silent = TRUE) +locate_areas(file, pages = NULL, resolution = 60L) extract_areas(file, pages = NULL, guess = FALSE, ...) } @@ -14,7 +14,7 @@ extract_areas(file, pages = NULL, guess = FALSE, ...) \item{pages}{An optional integer vector specifying pages to extract from. To extract multiple tables from a given page, repeat the page number (e.g., \code{c(1,2,2,3)}).} -\item{silent}{A logical indicating whether to silence the \code{\link[graphics]{locator}} function.} +\item{resolution}{An integer specifying the resolution of the PNG images conversions. A low resolution is used by default to speed image loading.} \item{guess}{See \code{\link{extract_tables}} (note the different default value).} @@ -27,10 +27,15 @@ For \code{extract_areas}, see \code{\link{extract_tables}}. For \code{locate_are Interactively identify areas and extract } \details{ -\code{extract_areas} is an interactive mode for \code{\link{extract_tables}} allowing the user to specify areas of each PDF page in a file that they would like extracted. In interactive mode, each page is rendered to a PNG file and displayed in an R graphics window sequentially, pausing on each page to call \code{\link[graphics]{locator}} so the user can specify two points (e.g., upper-left and lower-right) to define bounds of page area. \code{extract_areas} then passes these user-defined areas to \code{\link{extract_tables}}. \code{locate_areas} implements the interactive component only, without actually extracting; this might be useful for interactive work that needs some modification before executing \code{extract_tables} computationally. -} -\note{ -Currently, attempting to resize the graphics window at any point during this process will cause problems. +\code{extract_areas} is an interactive mode for \code{\link{extract_tables}} allowing the user to specify areas of each PDF page in a file that they would like extracted. When used, each page is rendered to a PNG file and displayed in an R graphics window sequentially, pausing on each page to call \code{\link[graphics]{locator}} so the user can click and highlight an area to extract. + +The exact behaviour is a somewhat platform-dependent. If graphics events are supported, then it is possibly to interactively highlight a page region, make changes to that region, and navigate through the pages of the document while retaining the area highlighted on each page. If graphics events are not supported (e.g., in RStudio), then some of this functionality is not available (see below). + +In \emph{full functionality mode}, the first mouse click initializes a highlighting rectangle; the second click confirms it. If unsatisfied with the selection, the process can be repeated. The window also responds to keystrokes. \kbd{PgDn}, \kbd{Right}, and \kbd{Down} advance to the next page image, while \kbd{PgUp}, \kbd{Left}, and \kbd{Up} return to the previous page image. \kbd{Home} returns to the first page image and \kbd{End} advances to the final page image. \kbd{Q} quits the interactive mode and proceeds with extraction. When navigating between pages, any selected areas will be displayed and can be edited. \kbd{Delete} removes a highlighted area from a page (and then displays it again). + +In \emph{reduced functionality mode}, the interface requires users to indicate the upper-left and lower-right (or upper-right and lower-left) corners of an area on each page, this area will be briefly confirmed with a highlighted rectangle and the next page will be displayed. Dynamic page navigation and area editing are not possible. + +In either mode, after the areas are selected, \code{extract_areas} passes these user-defined areas to \code{\link{extract_tables}}. \code{locate_areas} implements the interactive component only, without actually extracting; this might be useful for interactive work that needs some modification before executing \code{extract_tables} computationally. } \examples{ \dontrun{ diff --git a/man/extract_tables.Rd b/man/extract_tables.Rd index a5b0f4f..7d055b5 100644 --- a/man/extract_tables.Rd +++ b/man/extract_tables.Rd @@ -36,7 +36,13 @@ By default, a list of character matrices. This can be changed by specifying an a Extract tables from a file } \details{ -This function mimics the behavior of the Tabula command line utility. It returns a list of R character matrices containing tables extracted from a file by default. This response behavior can be changed by using the following options. \code{method = "character"} returns a list of single-element character vectors, where each vector is a tab-delimited, line-separate string of concatenated table cells. \code{method = "data.frame"} attempts to coerce the structure returned by \code{method = "character"} into a list of data.frames and returns character strings where this fails. \code{method = "csv"} writes the tables to comma-separated (CSV) files using Tabula's CSVWriter method in the same directory as the original PDF. \code{method = "tsv"} does the same but with tab-separated (TSV) files using Tabula's TSVWriter and \code{method = "json"} does the same using Tabula's JSONWriter method. The previous three methods all return the path to the directory containing the extract table files. \code{method = "asis"} returns the Java object reference, which can be useful for debugging or for writing a custom parser. +This function mimics the behavior of the Tabula command line utility. It returns a list of R character matrices containing tables extracted from a file by default. This response behavior can be changed by using the following options. +\itemize{ + \item \code{method = "character"} returns a list of single-element character vectors, where each vector is a tab-delimited, line-separate string of concatenated table cells. + \item \code{method = "data.frame"} attempts to coerce the structure returned by \code{method = "character"} into a list of data.frames and returns character strings where this fails. + \item \code{method = "csv"} writes the tables to comma-separated (CSV) files using Tabula's CSVWriter method in the same directory as the original PDF. \code{method = "tsv"} does the same but with tab-separated (TSV) files using Tabula's TSVWriter and \code{method = "json"} does the same using Tabula's JSONWriter method. Any of these three methods return the path to the directory containing the extract table files. + \item \code{method = "asis"} returns the Java object reference, which can be useful for debugging or for writing a custom parser. +} \code{\link{extract_areas}} implements this functionality in an interactive mode allowing the user to specify extraction areas for each page. } \examples{