r - How to scrape mutiple tables indexing both year+page? -


thanks post https://stackoverflow.com/a/7775721/7140722

but how scrape multiple years?

this structure of query:

http://aviation-safety.net/database/dblist.php?year=1994&lang=&page=1 http://aviation-safety.net/database/dblist.php?year=1994&lang=&page=2 

i want scrape more years. code:

year <- 1990:1994  url1 = 'http://aviation-safety.net/database/dblist.php?year=' url3 = '&lang=&page='  getpage <- function(page){   require(xml)   url = paste(url1, year, url3, page, sep = "")   tab = readhtmltable(url, stringsasfactors = false)[[1]]   return(tab) }  pages    = llply(1:3,getpage, .progress = 'text')  crash_all_years = do.call('rbind', pages) 

but doesn't work. suggestions?

i think better construct list of urls first , loop on list lapply (or ldply plyr package) pages.

you can improve code follows:

# load 'xml' package library(xml)  # set variables needed construct urls years <- 1990:1994 url_1 <- 'http://aviation-safety.net/database/dblist.php?year=' url_3 <- '&lang=&page=' pages <- 1:2  # construct list of pages scrape yp <- expand.grid(pages, years) urls <- sprintf('%s%s%s%s', url_1, yp[[2]], url_3, yp[[1]])  # simplified scrape function getpage <- function(u){ readhtmltable(u, stringsasfactors = false)[[1]] }  # loop on list of urls , scrape each 1 plst <- lapply(urls, getpage)  # bind resulting list of dataframes 1 dataframe pages.df <- do.call(rbind, plst) 

which dataframe airplane crashes of first 2 pages each year 1990 1994:

> head(pages.df)          date                             type registration                                operator fat.             location    pic cat 1 02-jan-1990 casa/nurtanio nc-212 aviocar 200       pk-pcm                      pelita air service    9      banten bay, ...         a1 2 03-jan-1990          bn-2a trislander mk.iii       yj-rv3                                  vanair    0 near port vila-ba...         a1 3 04-jan-1990    swearingen sa227-ac metro iii       n31138 chautauqua airlines, opf. usair express    0       hagerstown, md         o1 4 05-jan-1990       lockheed l-100-30 hercules       d2-thb                      angola air charter    0      menongue air...          c1 5 05-jan-1990      fokker f-28 fellowship 4000       lv-mzd                   aerolineas argentinas    0      villa gesell...          a1 6 06-jan-1990      lockheed l-1329 jetstar 731        n96gs                                grecoair    1      miami intern...         a1 > tail(pages.df)             date                          type registration                  operator fat.        location    pic cat 995  06-nov-1994                    antonov 26     ra-88286 kit space & transport air    0 omulyovka river         a1 996  09-nov-1994                    learjet 55       pt-lig       líder táxi aéreo    0 rio de janei...          a1 997  12-nov-1994 beechcraft 200 super king air       d2-eoj                   endiama    0 huambo-alban...          a1 998  13-nov-1994   fokker f-27 friendship 400m       7t-vrk              air algérie    0 palma de mal...         h2 999  16-nov-1994       beechcraft c99 commuter       n63995               ameriflight    1      avenal, ca          a1 1000 18-nov-1994                tupolev 134a-3       ha-lbk                     malev    0 budapest-fer...          o1 

with ldply can integrate last 2 steps one:

library(plyr) pages.df <- ldply(urls, getpage) 

notes:

  • when want pages each year, create longer pages vector. example pages <- 1:6. urls don't exist not scraped , not included in final dataframe. using this, dataframe 1238 rows, number of accidents in 1990 - 1994.
  • in sprintf code, each %s stands string needs pasted together. see ?sprintf.

Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -