sorting - keep all lines with >2 duplicates, based on one column -


i have file has millions of lines following format:

sn608   vb050   1       1113    1699.50 2339.90 0       1       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacccgtcaattttttaaaaaaaacccccggggggtagtttgttaaaa   a[_ceeeegggfgiiigecccccccccccccccccccccbcccccccccccttehgghhgjhgjsrgeggjy]]]tx[[[xeeox[eeeggjggggjs]  1 

i have sorted column 9 using sort -k9. column 9 100-letter string, though entries may have periods. want remove lines 100-letter string in column 9 occurs 1 or 2 times (<3 times), , save other lines file.

i played around uniq (-d -f9 -w100) , sort , suspect awk helpful, of novice figure out.

this store first , second matches in outfile , other matches in dups:

awk '++a[$9]>2{print $0>"dups";next}1' file > outfile 

Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -