Wisconsin Ads Project (now at Wesleyan) archives data on televised presidential, gubernatorial and congressional ads collected by Kantar media. The data includes flattened storyboards of each political ad. These storyboards are pdfs of static images for the years 2000 and 2002 (gubernatorial ads). (Since 2004, the storyboards have included an extractable text layer. The script for extracting the text layer using PyPdf can be found here.)
Here below are the steps for getting text from static image storyboads using abbyyR.
To get started, load the package. The latest version of the package will always be on github. Instructions for installing the package from github are provided below.
library(abbyyR)
Your first task on loading the package should be to set the credentials - application ID and password. If you havenโt already, you can get this information http://ocrsdk.com/. Once you have the application ID and password, set it via the setapp
function.
# setapp(c("factbook", "7YVBc8E6xMricoTwp0mF0aH"))
Some of you may want to start by deleting all existing tasks in an application.
"
all_tasks <- listTasks()
for (i in 1:nrow(all_tasks)) deleteTask(all_tasks$id[i])
"
# Set path to directory with all the images
path_to_img_dir <- paste0(path.package("abbyyR"),"/inst/extdata/wisc_ads/")
total_files <- length(dir(path_to_img_dir))
# Iterate through the files and submit all the images
# Monitor progress via progress bar package
library(progress)
pb <- progress_bar$new(format = " downloading [:bar] :percent\n",
total = total_files,
clear = FALSE, width= 60)
# Abbyy Fine API doesn't keep the file name so we have to keep track of it locally
tracker <- data.frame(filename=NA, taskid=NA)
# Loop
j <- 1
for (i in dir(path_to_img_dir)){
# Assuming only 1 dot in the file name
tracker[j,] <- c(unlist(strsplit(basename(i), "[.]"))[1], submitImage(file_path=paste0(path_to_img_dir, i))$id)
j <- j + 1
# Prg. bar
pb$tick()
Sys.sleep(1/100)
}
for (i in 1:nrow(tracker)) processDocument(tracker$taskid[i])
You can either wait and check manually or ping after every few seconds to check status like so:
"
i <- 1
while(i < total_files){
i <- nrow(listFinishedTasks())
if (i == total_files){
print("All Done!")
break;
}
Sys.sleep(5)
}
"
You need to setup an output folder. And then download all the completed files.
setwd(paste0(path.package("abbyyR"),"/inst/extdata/wisc_out/"))
finishedlist <- listFinishedTasks()
results <- merge(tracker, finishedlist, by.x="taskid", by.y="id")
library(curl)
for(i in 1:nrow(results)){
curl_download(results$resultUrl[i], destfile=results$filename[i])
}