r - How to scrape mutiple tables indexing both year+page? -
thanks post https://stackoverflow.com/a/7775721/7140722
but how scrape multiple years?
this structure of query:
http://aviation-safety.net/database/dblist.php?year=1994&lang=&page=1 http://aviation-safety.net/database/dblist.php?year=1994&lang=&page=2
i want scrape more years. code:
year <- 1990:1994 url1 = 'http://aviation-safety.net/database/dblist.php?year=' url3 = '&lang=&page=' getpage <- function(page){ require(xml) url = paste(url1, year, url3, page, sep = "") tab = readhtmltable(url, stringsasfactors = false)[[1]] return(tab) } pages = llply(1:3,getpage, .progress = 'text') crash_all_years = do.call('rbind', pages)
but doesn't work. suggestions?
i think better construct list of urls first , loop on list lapply
(or ldply
plyr
package) pages.
you can improve code follows:
# load 'xml' package library(xml) # set variables needed construct urls years <- 1990:1994 url_1 <- 'http://aviation-safety.net/database/dblist.php?year=' url_3 <- '&lang=&page=' pages <- 1:2 # construct list of pages scrape yp <- expand.grid(pages, years) urls <- sprintf('%s%s%s%s', url_1, yp[[2]], url_3, yp[[1]]) # simplified scrape function getpage <- function(u){ readhtmltable(u, stringsasfactors = false)[[1]] } # loop on list of urls , scrape each 1 plst <- lapply(urls, getpage) # bind resulting list of dataframes 1 dataframe pages.df <- do.call(rbind, plst)
which dataframe airplane crashes of first 2 pages each year 1990 1994:
> head(pages.df) date type registration operator fat. location  pic cat 1 02-jan-1990 casa/nurtanio nc-212 aviocar 200 pk-pcm pelita air service 9 banten bay, ...  a1 2 03-jan-1990 bn-2a trislander mk.iii yj-rv3 vanair 0 near port vila-ba...  a1 3 04-jan-1990 swearingen sa227-ac metro iii n31138 chautauqua airlines, opf. usair express 0 hagerstown, md  o1 4 05-jan-1990 lockheed l-100-30 hercules d2-thb angola air charter 0 menongue air... c1 5 05-jan-1990 fokker f-28 fellowship 4000 lv-mzd aerolineas argentinas 0 villa gesell... a1 6 06-jan-1990 lockheed l-1329 jetstar 731 n96gs grecoair 1 miami intern...  a1 > tail(pages.df) date type registration operator fat. location  pic cat 995 06-nov-1994 antonov 26 ra-88286 kit space & transport air 0 omulyovka river  a1 996 09-nov-1994 learjet 55 pt-lig lÃder táxi aéreo 0 rio de janei... a1 997 12-nov-1994 beechcraft 200 super king air d2-eoj endiama 0 huambo-alban... a1 998 13-nov-1994 fokker f-27 friendship 400m 7t-vrk air algérie 0 palma de mal...  h2 999 16-nov-1994 beechcraft c99 commuter n63995 ameriflight 1 avenal, ca a1 1000 18-nov-1994 tupolev 134a-3 ha-lbk malev 0 budapest-fer... o1
with ldply
can integrate last 2 steps one:
library(plyr) pages.df <- ldply(urls, getpage)
notes:
- when want pages each year, create longer
pages
vector. examplepages <- 1:6
. urls don't exist not scraped , not included in final dataframe. using this, dataframe 1238 rows, number of accidents in 1990 - 1994. - in
sprintf
code, each%s
stands string needs pasted together. see?sprintf
.
Comments
Post a Comment