scrapy simply visit a webpage using a sitemap -

June 15, 2013

i trying make crawler visits links on sitemap. scrapy project seems doing moves , i'm not quite sure if attempting other things while running. robots.txt have sets crawl delay 10. 1 thing noticed 1.xml file has link in list i'm wondering if might causing issue. getting crawl speed of 4-5 pages per minute. code using is:

from scrapy.spiders import spider localspider.items import localspideritem scrapy.http    import request scrapy.selector import selector scrapy.spiders import sitemapspider import scrapy import re  class myspider(sitemapspider):     name = "localspider"     allowed_domains = ['localhost']     download_delay = 10     concurrent_requests = 3     download_timeout = 120     redirect_max_times = 5     robotstxt_obey = true     sitemap_urls = ['http://localhost/sites/default/files/xmlsitemap/nxhscre0440pfpi5dsznevgmaul25kojd7u4e9azwom/1.xml']      def parse(self, response):         self.logger.info('parse function called on %s', response.url)         pass

i ran script without:

download_delay = 10 concurrent_requests = 3 download_timeout = 120 redirect_max_times = 5 robotstxt_obey = true

that moved afraid may flagged spam. caused laptop use memory. lamp stack stopped working , pages return error 500.

how can better optimize scrapy script? there 4,000 links.

Search This Blog

CSS

scrapy simply visit a webpage using a sitemap -

Comments

Post a Comment

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -