python - urlib & requests fail "sometimes" to get the final URL -
to give overview of problem, have list of twitter users "screen_names" , want verify wether suspended users or not. don't want use twitter search api avoid rate limits problem (the list quite big). therefore, trying use cluster of computers label dataset (wether account in database suspended or not).
if account suspended twitter , try access them through link http://www.twitter/screen_name
redirected https://twitter.com/account/suspended
i tried capture behaviour using python 2.7 urlib
using geturl()
method. works not reliable (i don't same results on same link). tested on same account , yet returns https://twitter.com/account/suspended
, other times returns http://www.twitter/screen_name
the same problem occurs requests.
my code:
import requests lxml import html screen_name = 'iamaguygetit' account_url = "https://twitter.com/"+screen_name url = requests.get(account_url) print url.url req = urllib.urlopen(url.url).read() page = html.fromstring(req) heading in page.xpath("//h1"): if heading.text == 'account suspended': print true
the twitter server serves 302 redirect once; after it'll assume browser has cached redirect.
the body of page contain pointer though, if not redirected can see there still link there:
r = requests.get(account_url) >>> r.url u'https://twitter.com/iamaguygetit' >>> r.text u'<html><body>you being <a href="https://twitter.com/account/suspended">redirected</a>.</body></html>'
look exact text.
Comments
Post a Comment