python - Does this benchmark seem relevant? -
i trying benchmark few method of itertools against generators , list comprehensions. idea want build iterator filtering entries base list.
here code came with(edited after accepted answer):
itertools import ifilter import collections import random import os timeit import timer os.system('cls') # define large arrays listarrays = [xrange(100), xrange(1000), xrange(10000), xrange(100000)] #number of element filtered out nb_elem = 100 # number of times run test nb_rep = 1000 def discard(it): collections.deque(it, maxlen=0) def testgenerator(arr, sample): discard(x x in sample if x in arr) def testiterator(arr, sample): discard(ifilter(sample.__contains__, arr)) def testlist(arr, sample): discard([x x in sample if x in arr]) if __name__ == '__main__': arr in listarrays: print 'size of array: %s ' % len(arr) print 'number of iterations %s' % nb_rep sample = random.sample(arr, nb_elem) t1 = timer('testiterator(arr, sample)', 'from __main__ import testiterator, arr, sample') tt1 = t1.timeit(number=nb_rep) t2 = timer('testlist(arr, sample)', 'from __main__ import testlist, arr, sample') tt2 = t2.timeit(number=nb_rep) t3 = timer('testgenerator(arr, sample)', 'from __main__ import testgenerator, arr, sample') tt3 = t3.timeit(number=nb_rep) norm = min(tt1, tt2, tt3) print 'maximum runtime %.6f' % max(tt1, tt2, tt3) print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % \ (tt1/norm, tt2/norm, tt3/norm) print '=========================================== ==========='
and results please note edited version not run on same machine ( useful have normalized results) , ran 32bits interpreter python 2.7.3 :
size of array: 100 number of iterations 1000 maximum runtime 0.125595 normalized times: iterator: 1.000000 list: 1.260302 generator: 1.276030 ====================================================== size of array: 1000 number of iterations 1000 maximum runtime 1.740341 normalized times: iterator: 1.466031 list: 1.010701 generator: 1.000000 ====================================================== size of array: 10000 number of iterations 1000 maximum runtime 17.033630 normalized times: iterator: 1.441600 list: 1.000000 generator: 1.010979 ====================================================== size of array: 100000 number of iterations 1000 maximum runtime 169.677963 normalized times: iterator: 1.455594 list: 1.000000 generator: 1.008846 ====================================================== could give suggestions on improvement , comment on whether or not benchmark can give accurate results?
i know condition in decorator might bias results. hoping suggestions regarding that.
thanks.
first, instead of trying duplicate timeit does, use it. time function may not have enough accuracy useful, , writing dozens of lines of scaffolding code (especially if has hacky things switching on func.__name__) don't need inviting bugs no reason.
assuming there no bugs, won't affect results significantly. you're doing tiny bit of work , charging testiterator, that's once per outer loop. still, there's no benefit doing it, let's not.
def testgenerator(arr,sample): in (x x in sample if x in arr): k = random.random() def testiterator(arr,sample): in ifilter(lambda x: x in sample, arr): k = random.random() def testlist(arr,sample): in [x x in sample if x in arr]: k = random.random() tests = testiterator, testgenerator, testlist arr in listarrays: print 'size of array: %s ' % len(arr) print 'number of iterations %s' % nb_rep sample = random.sample(arr, nb_elem) funcs = [partial(test, arr, sample) test in tests] times = [timeit.timeit(func, number=nb_rep) func in funcs] norm = min(*times) print 'maximum runtime %.6f' % max(*times) print 'normalized times:\n iterator: %.6f \n list: %.6f \n generator: %.6f' % (times[0]/norm,times[1]/norm,times[2]/norm) print '======================================================' next, why doing k = random.random() in there? quick test, executing line n times without complex loop 0.19x long whole thing. so, you're adding 20% each of numbers, dilutes difference between them no reason.
once rid of that, for loop serving no purpose except consume iterator, , that's adding overhead well. of 2.7.3 , 3.3.0, fastest way consume iterator without custom c code deque(it, maxlen=0), so, let's try this:
def discard(it): collections.deque(it, maxlen=0) def testgenerator(arr,sample): discard(x x in sample if x in arr) def testiterator(arr,sample): discard(ifilter(sample.__contains__, arr)) def testlist(arr,sample): discard([x x in sample if x in arr]) or, alternatively, have functions return generator/ifilter/list , make scaffolding call discard on result (it shouldn't matter either way).
meanwhile, testiterator case, trying test cost of lambda vs. inline expression, or cost of ifilter vs. generator? if want test former, correct; if latter, want optimize that. example, passing sample.__contains__ instead of lambda x: x in sample seems 20% faster in 64-bit python 3.3.0 , 30% faster in 32-bit 2.7.2 (although reason not faster @ in 64-bit 2.7.2).
finally, unless you're testing 1 implementation/platform/version, make sure run on many can. example, 64-bit cpython 2.7.2, list , generator neck-and-neck while iterator gradually climbs 1.0x 1.4x lists grow, in pypy 1.9.0, iterator fastest, generator , list starting off 2.1x , 1.9x slower closing 1.2x lists grow.
so, if decided against iterator because "it's slow", might trading large slowdown on pypy smaller speedup on cpython.
of course might acceptable, e.g., because slowest pypy run blazingly fast, or because none of users use pypy, or whatever. it's part of answer "is benchmark relevant?"
Comments
Post a Comment