python - How to scrape more than 100 google pages in one pass -
i using requests library in python get
data google results. https://www.google.com.pk/#q=pizza&num=10 return first 10 results of google mentioned num=10
. https://www.google.com.pk/#q=pizza&num=100 return 100 results of google results.
but
if write number more 100 let https://www.google.com.pk/#q=pizza&num=200 , google still returning first 100 results
how can more 100 in 1 pass?
code:
import requests url = 'http://www.google.com/search' my_headers = { 'user-agent' : 'mozilla/11.0' } payload = { 'q' : pizza, 'start' : '0', 'num' : 200 } r = requests.get( url, params = payload, headers = my_headers )
in "r" getting url's of google first 100 results, not 200
you can use more programmatic api google results vs. trying screen scrape human search interface, there's no error checking or assertion complies google t&cs, suggest details of using url:
import requests def search(query, pages=4, rsz=8): url = 'https://ajax.googleapis.com/ajax/services/search/web' params = { 'v': 1.0, # version 'q': query, # query string 'rsz': rsz, # result set size - max 8 } s in range(0, pages*rsz+1, rsz): params['start'] = s r = requests.get(url, params=params) result in r.json()['responsedata']['results']: yield result
e.g. getting 200 results 'google':
>>> list(search('google', pages=24, rsz=8)) [{'gsearchresultclass': 'gwebsearch', 'cacheurl': 'http://www.google.com/search?q=cache:y14fcuqogl4j:www.google.com', 'content': 'search world's information, including webpages, images, videos , more. \n<b>google</b> has many special features find you're looking\xa0...', 'title': '<b>google</b>', 'titlenoformatting': 'google', 'unescapedurl': 'https://www.google.com/', 'url': 'https://www.google.com/', 'visibleurl': 'www.google.com'}, ... ]
to use google's custom search api need sign developer. 100 free queries (i'm not sure if api calls or allows pagination of same query count 1 query) day:
- sign @ https://console.developers.google.com
- create project
- create
key
- enable custom search api
- create custom search engine @ https://cse.google.com
- use dummy site initialise cse
- edit cse search entire web
- delete dummy site
- get cse reference (look @ public url
cx=<cse reference>
)
the can use requests
make query:
import requests url = 'https://www.googleapis.com/customsearch/v1' params = { 'key': '<key>', 'cx': '<cse reference>', 'q': '<search>', 'num': 10, 'start': 1 } resp = requests.get(url, params=params) results = resp.json()['items']
with start
can similar pagination above.
there lots of other parameters available can @ rest documentation cse: https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request
google has client-api library: pip install google-api-python-client
can use:
from googleapiclient import discovery service = discovery.build('customsearch', 'v1', developerkey='<key>') params = { 'q': '<query>', 'cx': '<cse reference>', 'num': 10, 'start': 1 } query = service.cse().list(**params) results = query.execute()['items']
Comments
Post a Comment