python - Download webpage source up to a keyword -

i'm looking download source code website particular keyword (the websites forum i'm interested in source code first posts user details) need download source code until find "" first time in source code.

how webpage title without downloading page source

this question although in different language quite similar i'm looking although i'm not experienced python can't figure out how recode answer python.

first, aware may have gotten or of each page os buffers, nic, router, or isp before cancel, there may no benefit @ doing this. , there cost—you can't reuse connections if close them early; have recv smaller pieces @ time if want able cancel early; etc.

if have rough idea of how many bytes need read (better go little bit on go little bit under), , server handles http range requests, may want try instead of requesting entire file , closing socket early.

but, if want know how close socket early:

urllib2.urlopen, requests, , other high-level libraries designed around idea you're going want read whole file. buffer data comes in, give high-level file-like interface. on top of that, api blocking. neither of want. want bytes come in, fast possible, , when close socket, want after recv possible.

so, may want consider using 1 of python wrappers around libcurl, gives pretty balance between power/flexibility , ease-of-use. example, pycurl:

import pycurl  buf = ''  def callback(newbuf):     global buf     buf += newbuf     if '<div style="float: right; margin-left: 8px;">' in buf:         return 0     return len(newbuf)  c = pycurl.curl() c.setopt(c.url, 'http://curl.haxx.se/dev/') c.setopt(c.writefunction, callback) try:     c.perform() except exception e:     print(e) c.close()  print len(buf)

as turns out, ends reading 12259/12259 bytes on test. if change string comes in first 2650 bytes, read 2650/12259 bytes. , if fire wireshark , instrument recv, can see that, although next packet did arrive @ nic, never read it; closed socket after receiving 2650 bytes. so, might save time… although not much. more importantly, though, if throw @ 13mb image file , try stop after 1mb, receive few kb extra, , of image hasn't made router yet (although may have left server, if care @ being nice server), will save time.

of course typical forum page lot closer 12kb 13mb. (this page, example, under 48kb after rambling.) maybe you're dealing atypical forums.

if pages big, may want change code check buf[-len(needle):] + newbuf instead of whole buffer each time. 13mb image, searching whole thing on , on again didn't add total runtime, did raise cpu usage 1% 9%…

one last thing: if you're reading from, say, 500 pages, doing them concurrently—say, 8 @ time—is going save lot more time canceling each 1 early. both might better either on own, that's not argument against doing this—it's suggestion that well. (see receiver-multi.py sample if want let curl handle concurrency you… or use multiprocessing or concurrent.futures use pool of child processes.)

Search This Blog

Bready

python - Download webpage source up to a keyword -

Comments

Post a Comment

Popular posts from this blog

ios - iPhone/iPad different view orientations in different views , and apple approval process -

php - HTTP_REFERER woes: How can I allow access to a specific page, only when a visitor has visited another specific page beforehand? -

java Extracting Zip file -