curl - Using R to accept cookies to download a PDF file -

- September 15, 2011

i'm getting stuck on cookies when trying download pdf.

for example, if have doi pdf document on archaeology data service, resolve this landing page embedded link in pdf redirects this other link.

library(httr) handle resolving doi , can extract pdf url landing page using library(xml) i'm stuck @ getting pdf itself.

if this:

download.file("http://archaeologydataservice.ac.uk/archiveds/archivedownload?t=arch-1352-1/dissemination/pdf/dyfed/gl44004.pdf", destfile = "tmp.pdf")

then receive html file same http://archaeologydataservice.ac.uk/myads/

trying answer @ how use r download zipped file ssl page requires cookies leads me this:

library(httr)  terms <- "http://archaeologydataservice.ac.uk/myads/copyrights" download <- "http://archaeologydataservice.ac.uk/archiveds/archivedownload" values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/dyfed/gl44004.pdf")  # accept terms on form, # generating appropriate cookies  post(terms, body = values) get(download, query = values)  # download file (this take while)  resp <- get(download, query = values)  # write content of download binary file  writebin(content(resp, "raw"), "c:/temp/thefile.zip")

but after post , get functions html of same cookie page got download.file:

> get(download, query = values) response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]   date: 2016-01-06 00:35   status: 200   content-type: text/html;charset=utf-8   size: 21 kb <?xml version='1.0' encoding='utf-8' ?> <!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "h... <html xmlns="http://www.w3.org/1999/xhtml" lang="en">         <head>             <meta http-equiv="content-type" content="text/html; c...               <title>archaeology data service:  myads</title>              <link href="http://archaeologydataservice.ac.uk/css/u... ...

looking @ http://archaeologydataservice.ac.uk/about/cookies seems cookie situation @ site complicated. seems kind of cookie complexity not unusual uk data providers: automating login uk data service website in r rcurl or httr

how can use r past cookies on website?

your plea on ropensci has been heard!

there's lots of javascript between pages makes annoying try decipher via httr + rvest. try rselenium. worked on os x 10.11.2, r 3.2.3 & firefox loaded.

library(rselenium)  # check if sever present, if not, server checkforserver()  # server going startserver()  dir.create("~/justcreateddir") setwd("~/justcreateddir")  # need pdfs download instead of display in-browser prefs <- makefirefoxprofile(list(   `browser.download.folderlist` = as.integer(2),   `browser.download.dir` = getwd(),   `pdfjs.disabled` = true,   `plugin.scan.plid.all` = false,   `plugin.scan.acrobat` = "99.0",   `browser.helperapps.neverask.savetodisk` = 'application/pdf' )) # browser going dr <- remotedriver$new(extracapabilities=prefs) dr$open()  # go page pdf dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")  # find pdf link , "hit enter" pdf_elem <- dr$findelement(using="css selector", "a.dlb3") pdf_elem$sendkeystoelement(list("\ue007"))  # find accept button , "hit enter" # save pdf default downloads directory accept_elem <- dr$findelement(using="css selector", "a[id$='agreebutton']") accept_elem$sendkeystoelement(list("\ue007"))

now wait download complete. r console not busy while downloads, easy close session accidently, before download has completed.

# close session dr$close()

Search This Blog

If code

curl - Using R to accept cookies to download a PDF file -

Comments

Post a Comment

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -