sorting - keep all lines with >2 duplicates, based on one column -

i have file has millions of lines following format:

sn608   vb050   1       1113    1699.50 2339.90 0       1       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacccgtcaattttttaaaaaaaacccccggggggtagtttgttaaaa   a[_ceeeegggfgiiigecccccccccccccccccccccbcccccccccccttehgghhgjhgjsrgeggjy]]]tx[[[xeeox[eeeggjggggjs]  1

i have sorted column 9 using sort -k9. column 9 100-letter string, though entries may have periods. want remove lines 100-letter string in column 9 occurs 1 or 2 times (<3 times), , save other lines file.

i played around uniq (-d -f9 -w100) , sort , suspect awk helpful, of novice figure out.

this store first , second matches in outfile , other matches in dups:

awk '++a[$9]>2{print $0>"dups";next}1' file > outfile

Search This Blog

Bready

sorting - keep all lines with >2 duplicates, based on one column -

Comments

Post a Comment

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

c# - Using multiple datasets in RDLC -