python efficiency and large objects in memory -
i have multiple processes each dealing lists have 40000 tuples. maxes memory available on machine. if this:
while len(collection) > 0: row = collection.pop(0) row_count = row_count + 1 new_row = [] value in row: if value not none: in_chars = str(value) else: in_chars = "" #escape naughty characters new_row.append("".join(["\\" + c if c in redshift_escape_chars else c c in in_chars])) new_row = "\t".join(new_row) rows += "\n"+new_row if row_count % 5000 == 0: gc.collect()
does free more memory ?
since collection
shrinking @ same rate rows
growing, memory usage remain stable. gc.collect()
call not going make difference.
memory management in cpython subtle. because remove references , run collection cycle doesn't mean memory returned os. see this answer details.
to save memory, should structure code around generators , iterators instead of large lists of items. i'm surprised you're having connection timeouts because fetching rows should not take more time fetching row @ time , performing simple processing doing. perhaps should have @ db-fetching code?
if row-at-a-time processing not possibility, @ least keep data immutable deque , perform processing on generators , iterators.
i'll outline these different approaches.
first of all, common functions:
# if don't need random-access elements in sequence # deque uses less memory , has faster appends , deletes # both front , back. collections import deque itertools import izip, repeat, islice, chain import re re_redshift_chars = re.compile(r'[abcdefg]') def istrjoin(sep, seq): """return generator acts sep.join(seq), lazily separator yielded separately """ return islice(chain.from_iterable(izip(repeat(sep), seq)), 1, none) def escape_redshift(s): return re_redshift_chars.sub(r'\\\g<0>', s) def tabulate(row): return "\t".join(escape_redshift(str(v)) if v not none else '' v in row)
now ideal row-at-a-time processing, this:
cursor = db.cursor() cursor.execute("""select * bigtable""") rowstrings = (tabulate(row) row in cursor.fetchall()) lines = istrjoin("\n", rowstrings) file_like_obj.writelines(lines) cursor.close()
this take least possible amount of memory--only row @ time.
if need store entire resultset, can modify code slightly:
cursor = db.cursor() cursor.execute("select * bigtable") collection = deque(cursor.fetchall()) cursor.close() rowstrings = (tabulate(row) row in collection) lines = istrjoin("\n", rowstrings) file_like_obj.writelines(lines)
now gather results collection
first remains entirely in memory entire program run.
however can duplicate approach of deleting collection items used. can keep same "code shape" creating generator empties source collection works. this:
def drain(coll): """return iterable deletes items coll yields them. coll must support `coll.pop(0)` or `del coll[0]`. deque recommended! """ if hasattr(coll, 'pop'): def pop(coll): try: return coll.pop(0) except indexerror: raise stopiteration else: def pop(coll): try: item = coll[0] except indexerror: raise stopiteration del coll[0] return item while true: yield pop(coll)
now can substitute drain(collection)
collection
when want free memory go. after drain(collection)
exhausted, collection
object empty.
Comments
Post a Comment