python - stratified sampling in numpy -
in numpy have dataset this. first 2 columns indices. can divide dataset blocks via indices, i.e. first block 0 0 second block 0 1 third block 0 2 1 0, 1 1, 1 2 , on , forth. each block has @ least 2 elements. numbers in indices columns can vary
i need split dataset along these blocks 80%-20% randomly such after split each block in both datasets has @ least 1 element. how that?
indices | real data | 0 0 | 43.25 665.32 ... } 1st block 0 0 | 11.234 } 0 1 ... } 2nd block 0 1 } 0 2 } 3rd block 0 2 } 1 0 } 4th block 1 0 } 1 0 } 1 1 ... 1 1 1 2 1 2 2 0 2 0 2 1 2 1 2 1 ...
see how this. introduce randomness, shuffling entire dataset. way have figured how splitting vectorized. maybe shuffle indexing array, 1 indirection many brain today. have used structured array, ease in extracting blocks. first, lets create sample dataset:
from __future__ import division import numpy np # create sample data set c1, c2 = 10, 5 idx1, idx2 = np.arange(c1), np.arange(c2) idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1) items = 1000 = np.random.randint(c1*c2, size=(items - 2*c1*c2,)) d = np.random.rand(items+5) dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int), ('data', np.float)]) dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2) dataset['idx1'][2*c1*c2:-5] = idx1[i] dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2) dataset['idx2'][2*c1*c2:-5] = idx2[i] dataset['data'] = d # add blocks 2 , 3 elements test corner case dataset['idx1'][-5:] = -1 dataset['idx2'][-5:] = [0] * 2 + [1]*3
and stratified sampling:
# randomness, shuffle entire array np.random.shuffle(dataset) blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=true) block_count = np.bincount(_) = np.argsort(_) block_start = np.concatenate(([0], np.cumsum(block_count)[:-1])) # if have n elements in block, , assign 1 each array, # left n-2. if randomly assign fraction x of these # first array, expected ratio of items # (x*(n-2) + 1) : ((1-x)*(n-2) + 1) # setting ratio equal 4 (80/20) , solving x, # x = 4/5 + 3/5/(n-2) x = 4/5 + 3/5/(block_count - 2) x = np.clip(x, 0, 1) # if n in (2, 3), ratio larger 1 threshold = np.repeat(x, block_count) threshold[block_start] = 1 # first item goes threshold[block_start + 1] = 0 # seconf item goes b a_idx = threshold > np.random.rand(len(dataset)) = dataset[where[a_idx]] b = dataset[where[~a_idx]]
after running it, split 80/20, , blocks represented in both arrays:
>>> len(a) 815 >>> len(b) 190 >>> np.all(np.unique(a[['idx1', 'idx2']]) == np.unique(b[['idx1', 'idx2']])) true
Comments
Post a Comment