Crawling multiple sites with Python Scrapy with limited depth per site -
i new scrapy , trying crawl multiple sites text file crawlspider. limit depth of scraping per site , total number of crawled pages again per web site. unfortunately, when start_urls , allowed_domains attributes set response.meta['depth'] seems 0 (this doesn't happen when trying scrape individual sites). setting depth_limit in settings file doesn't seem @ all. when remove init definition , set start_urls , allowed_domains things seem working fine. here code (sorry indentation -- not issue):
class downloadspider(crawlspider): name = 'downloader' rules = ( rule(sgmllinkextractor(), callback='parse_item', follow=true), ) def __init__(self, urls_file, n=10): data = open(urls_file, 'r').readlines()[:n] self.allowed_domains = [urlparse(i).hostname.strip() in data] self.start_urls = ['http://' + domain domain in self.allowed_domains] def parse_start_url(self, response): return self.parse_item(response) def parse_item(self, response): print response.url print response.meta['depth'] this results in response.meta['depth'] equal 0 , cralwer crawls first site of each element of start_urls (i.e. doesn't follow links). have 2 questions 1) how limit crawl depth per each site in start_urls 2) how limit total number of crawls per site irrespective of depth
thanks !
don't forget call base class constructors (for example super):
def __init__(self, urls_file, n=10, *a, **kw): data = open(urls_file, 'r').readlines()[:n] self.allowed_domains = [urlparse(i).hostname.strip() in data] self.start_urls = ['http://' + domain domain in self.allowed_domains] super(downloadspider, self).__init__(*a, **kw) update:
when override method in python base class method no longer called , instead new method called, means if want new logic run in addition old logic (i.e. not instead of), need call old logic manually.
here logic missing not calling crawlspider.__init__() (via super(downloadspider, self).__init__()):
self._compile_rules()
Comments
Post a Comment