python - How to crawl only two pre-defined pages, but they scrape different items? -
i'm trying scrape site driven user input. example, user gives me pid of product , name, , separate program launch spider, gather data, , return user.
however, information want product , person found in 2 links xml. if know these 2 links , pattern, how build callback parse different items?
for example, if have these 2 items defined:
class personitem(item): name = field() ... class productitem(item): pid = field() ...
and know links have pattern:
www.example.com/person/*<name_of_person>*/person.xml www.example.com/*<product_pid>*/product.xml
then spider this:
class myspider(basespider): name = "myspider" # simulated given user pid = "4545-fw" person = "bob" allowed_domains = ["http://www.example.com"] start_urls = ['http://www.example.com/person/%s/person.xml'%person, 'http://www.example.com/%s/product.xml'%pid] def parse(self, response): # not sure here if scrapping person or item
i know can define rules using rule(sgmllinkextractor())
, giving person , product each own parse callback. however, i'm not sure how apply here since think rules meant crawling deeper, whereas need scrape surface level.
if want retro-active put logic in parse()
:
def parse(self, response): if 'person.xml' in response.url: item = personitem() elif 'product.xml' in response.url: item = productitem() else: raise exception('could not determine item type')
update:
if want pro-active override start_requests()
:
class myspider(basespider): name = "myspider" allowed_domains = ["example.com"] pid = "4545-fw" person = "bob" def start_requests(self): start_urls = ( ('http://www.example.com/person/%s/person.xml' % self.person, personitem), ('http://www.example.com/%s/product.xml' % self.pid, productitem), ) url, cls in start_urls: yield request(url, meta=dict(cls=cls)) def parse(self, response): item = response.meta['cls']()
Comments
Post a Comment