python - Removing duplicated lines from a txt file -
i processing large text files (~20mb) containing data delimited line. data entries duplicated , want remove these duplications keep 1 copy.
also, make problem more complicated, entries repeated bit of info appended. in case need keep entry containing info , delete older versions.
e.g. need go this:
bob 123 1db jim 456 3db ax dave 789 1db bob 123 1db jim 456 3db ax dave 789 1db bob 123 1db bitsthis:
jim 456 3db ax dave 789 1db bob 123 1db bitsnb. final order doesn't matter.
what efficient way this?
i can use awk, python or standard linux command line tool.
thanks.
how following (in python):
prev = none line in sorted(open('file')): line = line.strip() if prev not none , not line.startswith(prev): print prev prev = line if prev not none: print prev
if find memory usage issue, can sort pre-processing step using unix sort
(which disk-based) , change script doesn't read entire file memory.
Comments
Post a Comment