Crawling multiple sites with Python Scrapy with limited depth per site -

i new scrapy , trying crawl multiple sites text file crawlspider. limit depth of scraping per site , total number of crawled pages again per web site. unfortunately, when start_urls , allowed_domains attributes set response.meta['depth'] seems 0 (this doesn't happen when trying scrape individual sites). setting depth_limit in settings file doesn't seem @ all. when remove init definition , set start_urls , allowed_domains things seem working fine. here code (sorry indentation -- not issue):

class downloadspider(crawlspider):   name = 'downloader'   rules = (     rule(sgmllinkextractor(), callback='parse_item', follow=true),     )   def __init__(self, urls_file, n=10):       data = open(urls_file, 'r').readlines()[:n]       self.allowed_domains = [urlparse(i).hostname.strip() in data]        self.start_urls = ['http://' + domain domain in self.allowed_domains]    def parse_start_url(self, response):       return self.parse_item(response)    def parse_item(self, response):       print response.url       print response.meta['depth']

this results in response.meta['depth'] equal 0 , cralwer crawls first site of each element of start_urls (i.e. doesn't follow links). have 2 questions 1) how limit crawl depth per each site in start_urls 2) how limit total number of crawls per site irrespective of depth

thanks !

don't forget call base class constructors (for example super):

def __init__(self, urls_file, n=10, *a, **kw):     data = open(urls_file, 'r').readlines()[:n]     self.allowed_domains = [urlparse(i).hostname.strip() in data]     self.start_urls = ['http://' + domain domain in self.allowed_domains]     super(downloadspider, self).__init__(*a, **kw)

update:

when override method in python base class method no longer called , instead new method called, means if want new logic run in addition old logic (i.e. not instead of), need call old logic manually.

here logic missing not calling crawlspider.__init__() (via super(downloadspider, self).__init__()):

self._compile_rules()

Search This Blog

Bready

Crawling multiple sites with Python Scrapy with limited depth per site -

Comments

Post a Comment

Popular posts from this blog

ios - iPhone/iPad different view orientations in different views , and apple approval process -

php - HTTP_REFERER woes: How can I allow access to a specific page, only when a visitor has visited another specific page beforehand? -

java Extracting Zip file -