python - stratified sampling in numpy -


in numpy have dataset this. first 2 columns indices. can divide dataset blocks via indices, i.e. first block 0 0 second block 0 1 third block 0 2 1 0, 1 1, 1 2 , on , forth. each block has @ least 2 elements. numbers in indices columns can vary

i need split dataset along these blocks 80%-20% randomly such after split each block in both datasets has @ least 1 element. how that?

indices | real data         | 0   0   | 43.25 665.32 ...  } 1st block 0   0   | 11.234            } 0   1     ...               } 2nd block 0   1                       }  0   2                       } 3rd block 0   2                       } 1   0                       } 4th block 1   0                       } 1   0                       } 1   1                       ... 1   1                        1   2 1   2 2   0 2   0  2   1 2   1 2   1 ... 

see how this. introduce randomness, shuffling entire dataset. way have figured how splitting vectorized. maybe shuffle indexing array, 1 indirection many brain today. have used structured array, ease in extracting blocks. first, lets create sample dataset:

from __future__ import division import numpy np  # create sample data set c1, c2 = 10, 5 idx1, idx2 = np.arange(c1), np.arange(c2) idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)  items = 1000 = np.random.randint(c1*c2, size=(items - 2*c1*c2,)) d = np.random.rand(items+5)  dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),                              ('data', np.float)]) dataset['idx1'][:2*c1*c2] =  np.tile(idx1, 2) dataset['idx1'][2*c1*c2:-5] = idx1[i] dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2) dataset['idx2'][2*c1*c2:-5] = idx2[i] dataset['data'] = d # add blocks 2 , 3 elements test corner case dataset['idx1'][-5:] = -1 dataset['idx2'][-5:] = [0] * 2 + [1]*3 

and stratified sampling:

# randomness, shuffle entire array np.random.shuffle(dataset)  blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=true) block_count = np.bincount(_) = np.argsort(_) block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))  # if have n elements in block, , assign 1 each array, # left n-2. if randomly assign fraction x of these # first array, expected ratio of items # (x*(n-2) + 1) : ((1-x)*(n-2) + 1) # setting ratio equal 4 (80/20) , solving x, # x = 4/5 + 3/5/(n-2)  x = 4/5 + 3/5/(block_count - 2) x = np.clip(x, 0, 1) # if n in (2, 3), ratio larger 1 threshold = np.repeat(x, block_count) threshold[block_start] = 1 # first item goes threshold[block_start + 1] = 0 # seconf item goes b  a_idx = threshold > np.random.rand(len(dataset))  = dataset[where[a_idx]] b = dataset[where[~a_idx]] 

after running it, split 80/20, , blocks represented in both arrays:

>>> len(a) 815 >>> len(b) 190 >>> np.all(np.unique(a[['idx1', 'idx2']]) == np.unique(b[['idx1', 'idx2']])) true 

Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -