python - How to crawl and download files from a dynamic URL? -
i have own python crawler(based on cs101 udacity.com), trying download files(installers) download.cnet.com, when crawler crawling, want work this:
tell if link download link:
response = urllib2.urlopen('http://example.com/')
content_type = response.info().get('content-type')
print content_type
if crawler gets:
application/octet-stream- the crawler download installer link
the problem download.com doesn't seem provide real download link, , crawler can't find download link dynamic links. example, when tried download opera in download.com, have message this: "your download begin in moment. if doesn't, restart download." when checked "restart download" link, expecting real download link(e.g. download.com/blah/opera.exe), instead got wierd address crawler couldn't understand.
so have confirmed http://googlewebmastercentral.blogspot.no/2008/09/dynamic-urls-vs-static-urls.html download.com using dynamic links, how should in order let crawler find link can download installer download.com?
as you've said, you're getting javascript or ajax in page activates download in "real" browser while stymying efforts automate it.
here's discussion of same issue: stackoverflow: mechanize , javascript. noted there, 1 option use alternative python such phantomjs or browser automation framework (with optional "remote control") such selenium.
Comments
Post a Comment