python efficiency and large objects in memory -

i have multiple processes each dealing lists have 40000 tuples. maxes memory available on machine. if this:

        while len(collection) > 0:             row = collection.pop(0)             row_count = row_count + 1             new_row = []             value in row:                 if value not none:                     in_chars = str(value)                 else:                     in_chars = ""                  #escape naughty characters                 new_row.append("".join(["\\" + c if c in redshift_escape_chars else c c in in_chars]))             new_row = "\t".join(new_row)             rows += "\n"+new_row             if row_count % 5000 == 0:                 gc.collect()

does free more memory ?

since collection shrinking @ same rate rows growing, memory usage remain stable. gc.collect() call not going make difference.

memory management in cpython subtle. because remove references , run collection cycle doesn't mean memory returned os. see this answer details.

to save memory, should structure code around generators , iterators instead of large lists of items. i'm surprised you're having connection timeouts because fetching rows should not take more time fetching row @ time , performing simple processing doing. perhaps should have @ db-fetching code?

if row-at-a-time processing not possibility, @ least keep data immutable deque , perform processing on generators , iterators.

i'll outline these different approaches.

first of all, common functions:

# if don't need random-access elements in sequence # deque uses less memory , has faster appends , deletes # both front , back. collections import deque itertools import izip, repeat, islice, chain import re  re_redshift_chars = re.compile(r'[abcdefg]')  def istrjoin(sep, seq):     """return generator acts sep.join(seq), lazily      separator yielded separately     """     return islice(chain.from_iterable(izip(repeat(sep), seq)), 1, none)  def escape_redshift(s):     return re_redshift_chars.sub(r'\\\g<0>', s)  def tabulate(row):     return "\t".join(escape_redshift(str(v)) if v not none else '' v in row)

now ideal row-at-a-time processing, this:

cursor = db.cursor() cursor.execute("""select * bigtable""") rowstrings = (tabulate(row) row in cursor.fetchall()) lines = istrjoin("\n", rowstrings) file_like_obj.writelines(lines) cursor.close()

this take least possible amount of memory--only row @ time.

if need store entire resultset, can modify code slightly:

cursor = db.cursor() cursor.execute("select * bigtable") collection = deque(cursor.fetchall()) cursor.close() rowstrings = (tabulate(row) row in collection) lines = istrjoin("\n", rowstrings) file_like_obj.writelines(lines)

now gather results collection first remains entirely in memory entire program run.

however can duplicate approach of deleting collection items used. can keep same "code shape" creating generator empties source collection works. this:

def drain(coll):     """return iterable deletes items coll yields them.      coll must support `coll.pop(0)` or `del coll[0]`. deque recommended!     """     if hasattr(coll, 'pop'):         def pop(coll):             try:                 return coll.pop(0)             except indexerror:                 raise stopiteration     else:         def pop(coll):             try:                 item = coll[0]             except indexerror:                 raise stopiteration             del coll[0]             return item     while true:         yield pop(coll)

now can substitute drain(collection) collection when want free memory go. after drain(collection) exhausted, collection object empty.

Search This Blog

Bready

python efficiency and large objects in memory -

Comments

Post a Comment

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

c# - Using multiple datasets in RDLC -