curl - Using R to accept cookies to download a PDF file -
i'm getting stuck on cookies when trying download pdf.
for example, if have doi pdf document on archaeology data service, resolve this landing page embedded link in pdf redirects this other link.
library(httr) handle resolving doi , can extract pdf url landing page using library(xml) i'm stuck @ getting pdf itself.
if this:
download.file("http://archaeologydataservice.ac.uk/archiveds/archivedownload?t=arch-1352-1/dissemination/pdf/dyfed/gl44004.pdf", destfile = "tmp.pdf") then receive html file same http://archaeologydataservice.ac.uk/myads/
trying answer @ how use r download zipped file ssl page requires cookies leads me this:
library(httr) terms <- "http://archaeologydataservice.ac.uk/myads/copyrights" download <- "http://archaeologydataservice.ac.uk/archiveds/archivedownload" values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/dyfed/gl44004.pdf") # accept terms on form, # generating appropriate cookies post(terms, body = values) get(download, query = values) # download file (this take while) resp <- get(download, query = values) # write content of download binary file writebin(content(resp, "raw"), "c:/temp/thefile.zip") but after post , get functions html of same cookie page got download.file:
> get(download, query = values) response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466] date: 2016-01-06 00:35 status: 200 content-type: text/html;charset=utf-8 size: 21 kb <?xml version='1.0' encoding='utf-8' ?> <!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "h... <html xmlns="http://www.w3.org/1999/xhtml" lang="en"> <head> <meta http-equiv="content-type" content="text/html; c... <title>archaeology data service: myads</title> <link href="http://archaeologydataservice.ac.uk/css/u... ... looking @ http://archaeologydataservice.ac.uk/about/cookies seems cookie situation @ site complicated. seems kind of cookie complexity not unusual uk data providers: automating login uk data service website in r rcurl or httr
how can use r past cookies on website?
your plea on ropensci has been heard!
there's lots of javascript between pages makes annoying try decipher via httr + rvest. try rselenium. worked on os x 10.11.2, r 3.2.3 & firefox loaded.
library(rselenium) # check if sever present, if not, server checkforserver() # server going startserver() dir.create("~/justcreateddir") setwd("~/justcreateddir") # need pdfs download instead of display in-browser prefs <- makefirefoxprofile(list( `browser.download.folderlist` = as.integer(2), `browser.download.dir` = getwd(), `pdfjs.disabled` = true, `plugin.scan.plid.all` = false, `plugin.scan.acrobat` = "99.0", `browser.helperapps.neverask.savetodisk` = 'application/pdf' )) # browser going dr <- remotedriver$new(extracapabilities=prefs) dr$open() # go page pdf dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755") # find pdf link , "hit enter" pdf_elem <- dr$findelement(using="css selector", "a.dlb3") pdf_elem$sendkeystoelement(list("\ue007")) # find accept button , "hit enter" # save pdf default downloads directory accept_elem <- dr$findelement(using="css selector", "a[id$='agreebutton']") accept_elem$sendkeystoelement(list("\ue007")) now wait download complete. r console not busy while downloads, easy close session accidently, before download has completed.
# close session dr$close()
Comments
Post a Comment