Finding and downloading images within the Wikipedia Dump -
i'm trying find comprehensive list of images on wikipedia, can filter down public domain ones. i've downloaded sql dumps here:
http://dumps.wikimedia.org/enwiki/latest/
and studied db schema:
i think understand when pick sample image wikipedia page can't find anywhere in dumps. example:
http://en.wikipedia.org/wiki/file:carrizo_2a.jpg
i've done grep on dumps 'image', 'imagelinks', , 'page' looking 'carrizo_2a.jpg' , it's not found.
are these dumps not complete? misunderstanding structure? there better way this?
also, jump ahead 1 step: after have filtered list down , want download bulk set of images (thousands) saw mentions need mirror of site prevent overloading wikipedia/wikimedia. if has guidance on too, helpful.
mediawiki stores file data in 2 or 3 places, depending on how count:
the actual metadata current file versions stored in
image
table. want; you'll find latest en.wikipedia dump of here.data old superseded file revisions moved
oldimage
table, has same structureimage
table. table dumped, latest 1 here.finally, each file (normally) corresponds pretty ordinary wiki page in namespace 6 (
file:
). you'll find text of these in xml dumps, same other pages.
oh, , reason you're not finding files linked in english wikipedia dumps they're shared repository @ wikimedia commons. you'll find them in commons data dumps instead.
as downloading actual files, here's (apparently) official documentation. far can tell, mean "bulk download (as of september 2012) available mirrors not offered directly wikimedia servers." if want images in tarball, you'll have use mirror. if you're pulling relatively small subset of millions on images on wikipedia and/or commons, should fine use wikimedia servers directly.
just remember exercise basic courtesy: send user-agent string identifying , don't hit servers hard. in particular, i'd recommend running downloads sequentially, start downloading next file after you've finished previous one. not easier implement parallel downloading anyway, ensures don't hog more share of bandwidth , allows download speed more or less automatically adapt server load.
ps. whether download files mirror or directly wikimedia servers, going need figure out directory they're in. typical wikipedia file urls this:
http://upload.wikimedia.org/wikipedia/en/a/ab/file_name.jpg
where "wikipedia/en
" part identifies wikimedia project , language (for historical reasons, commons listed "wikipedia/commons
") , the "a/ab
" part given first 2 hex digits of md5 hash of filename in utf-8 (as they're encoded in database dumps).
Comments
Post a Comment