python - urlib & requests fail "sometimes" to get the final URL -


to give overview of problem, have list of twitter users "screen_names" , want verify wether suspended users or not. don't want use twitter search api avoid rate limits problem (the list quite big). therefore, trying use cluster of computers label dataset (wether account in database suspended or not).

if account suspended twitter , try access them through link http://www.twitter/screen_name redirected https://twitter.com/account/suspended

i tried capture behaviour using python 2.7 urlib using geturl() method. works not reliable (i don't same results on same link). tested on same account , yet returns https://twitter.com/account/suspended , other times returns http://www.twitter/screen_name

the same problem occurs requests.

my code:

import requests lxml import html screen_name = 'iamaguygetit' account_url = "https://twitter.com/"+screen_name url = requests.get(account_url) print url.url req = urllib.urlopen(url.url).read() page = html.fromstring(req) heading in page.xpath("//h1"):     if heading.text == 'account suspended':         print true 

the twitter server serves 302 redirect once; after it'll assume browser has cached redirect.

the body of page contain pointer though, if not redirected can see there still link there:

r = requests.get(account_url) >>> r.url u'https://twitter.com/iamaguygetit' >>> r.text u'<html><body>you being <a href="https://twitter.com/account/suspended">redirected</a>.</body></html>' 

look exact text.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -