sorting - keep all lines with >2 duplicates, based on one column -
i have file has millions of lines following format:
sn608 vb050 1 1113 1699.50 2339.90 0 1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacccgtcaattttttaaaaaaaacccccggggggtagtttgttaaaa a[_ceeeegggfgiiigecccccccccccccccccccccbcccccccccccttehgghhgjhgjsrgeggjy]]]tx[[[xeeox[eeeggjggggjs] 1
i have sorted column 9 using sort -k9
. column 9 100-letter string, though entries may have periods. want remove lines 100-letter string in column 9 occurs 1 or 2 times (<3 times), , save other lines file.
i played around uniq
(-d -f9 -w100
) , sort
, suspect awk
helpful, of novice figure out.
this store first , second matches in outfile
, other matches in dups
:
awk '++a[$9]>2{print $0>"dups";next}1' file > outfile
Comments
Post a Comment