hadoop - How to optimize this MapReduce function, Python, mrjob -


i'm new map/reduce principles , python mrjob framework, wrote sample code, , works fine, know can change in make "perfect" / more efficient.

from mrjob.job import mrjob import operator import re  # append result each reducer  output_words = []  class mrsudo(mrjob):      def init_mapper(self):         # move list of tuples across mapper         self.words = []      def mapper(self, _, line):         command = line.split()[-1]         self.words.append((command, 1))      def final_mapper(self):     word_pair in self.words:             yield word_pair      def reducer(self, command, count):          # append tuples list         output_words.append((command, sum(count)))      def final_reducer(self):         # sort tuples in list occurence         map(operator.itemgetter(1), output_words)         sorted_words = sorted(output_words, key=operator.itemgetter(1), reverse=true)         result in sorted_words:             yield result      def steps(self):         return [self.mr(mapper_init=self.init_mapper,                         mapper=self.mapper,                         mapper_final=self.final_mapper,                         reducer=self.reducer,                         reducer_final=self.final_reducer)]  if __name__ == '__main__':     mrsudo.run() 

there 2 ways follow.

1. improve process

your doing distributed word count. operation algebraic not taking advantage of property.

for every words of input sending record reducers. theses bytes have partitioned, sent on network , sorted reducer. nor efficient nor scalable, amount of data send mappers reducers bottleneck.

you should add combiner job. same thing current reducer. combiner run after mapper in same address space. means amount of data sending on network no longer linear number of words of input, bounded number of unique words. several order of magnitude lower.

since distributed word count example overused, find more information searching "distributed word count combiner". algebraic operations must have combiner.

2. use more efficient tools

mrjob great tool write map reduce jobs. faster write python job java one. has runtime cost:

  1. python slower java
  2. mrjob slower of python framework because not, yet, use typedbytes

you have decide if worths rewriting of jobs in java using regular api. if writing long lived batch jobs, make sense invest development time decrease runtime costs.

in long term writing java job not longer writing in python. have make front investments: create project build system, package it, deploy etc. mrjob have execute python text file.

cloudera did benchmark of hadoop python frameworks few months ago. mrjob way slower java jobs (5 7 times). mrjob performances should improve when typedbytes available java jobs still 2 3 times faster.


Comments

Popular posts from this blog

ios - iPhone/iPad different view orientations in different views , and apple approval process -

java Extracting Zip file -

C# WinForm - loading screen -