python - Removing duplicated lines from a txt file -


i processing large text files (~20mb) containing data delimited line. data entries duplicated , want remove these duplications keep 1 copy.

also, make problem more complicated, entries repeated bit of info appended. in case need keep entry containing info , delete older versions.

e.g. need go this:

 bob 123 1db jim 456 3db ax dave 789 1db bob 123 1db jim 456 3db ax dave 789 1db bob 123 1db bits 
this:
 jim 456 3db ax dave 789 1db bob 123 1db bits 
nb. final order doesn't matter.

what efficient way this?

i can use awk, python or standard linux command line tool.

thanks.

how following (in python):

prev = none line in sorted(open('file')):   line = line.strip()   if prev not none , not line.startswith(prev):     print prev   prev = line if prev not none:   print prev 

if find memory usage issue, can sort pre-processing step using unix sort (which disk-based) , change script doesn't read entire file memory.


Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

c# - How to add a new treeview at the selected node? -

java - netbeans "Please wait - classpath scanning in progress..." -