scrapy simply visit a webpage using a sitemap -
i trying make crawler visits links on sitemap. scrapy project seems doing moves , i'm not quite sure if attempting other things while running. robots.txt have sets crawl delay 10. 1 thing noticed 1.xml file has link in list i'm wondering if might causing issue. getting crawl speed of 4-5 pages per minute. code using is:
from scrapy.spiders import spider localspider.items import localspideritem scrapy.http import request scrapy.selector import selector scrapy.spiders import sitemapspider import scrapy import re class myspider(sitemapspider): name = "localspider" allowed_domains = ['localhost'] download_delay = 10 concurrent_requests = 3 download_timeout = 120 redirect_max_times = 5 robotstxt_obey = true sitemap_urls = ['http://localhost/sites/default/files/xmlsitemap/nxhscre0440pfpi5dsznevgmaul25kojd7u4e9azwom/1.xml'] def parse(self, response): self.logger.info('parse function called on %s', response.url) pass i ran script without:
download_delay = 10 concurrent_requests = 3 download_timeout = 120 redirect_max_times = 5 robotstxt_obey = true that moved afraid may flagged spam. caused laptop use memory. lamp stack stopped working , pages return error 500.
how can better optimize scrapy script? there 4,000 links.
Comments
Post a Comment