sorting - keep all lines with >2 duplicates, based on one column -
i have file has millions of lines following format:
sn608 vb050 1 1113 1699.50 2339.90 0 1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacccgtcaattttttaaaaaaaacccccggggggtagtttgttaaaa a[_ceeeegggfgiiigecccccccccccccccccccccbcccccccccccttehgghhgjhgjsrgeggjy]]]tx[[[xeeox[eeeggjggggjs] 1 i have sorted column 9 using sort -k9. column 9 100-letter string, though entries may have periods. want remove lines 100-letter string in column 9 occurs 1 or 2 times (<3 times), , save other lines file.
i played around uniq (-d -f9 -w100) , sort , suspect awk helpful, of novice figure out.
this store first , second matches in outfile , other matches in dups:
awk '++a[$9]>2{print $0>"dups";next}1' file > outfile
Comments
Post a Comment