python - Removing duplicated lines from a txt file -

- March 15, 2015

i processing large text files (~20mb) containing data delimited line. data entries duplicated , want remove these duplications keep 1 copy.

also, make problem more complicated, entries repeated bit of info appended. in case need keep entry containing info , delete older versions.

e.g. need go this:

 bob 123 1db jim 456 3db ax dave 789 1db bob 123 1db jim 456 3db ax dave 789 1db bob 123 1db bits

this:

 jim 456 3db ax dave 789 1db bob 123 1db bits

nb. final order doesn't matter.

what efficient way this?

i can use awk, python or standard linux command line tool.

thanks.

how following (in python):

prev = none line in sorted(open('file')):   line = line.strip()   if prev not none , not line.startswith(prev):     print prev   prev = line if prev not none:   print prev

if find memory usage issue, can sort pre-processing step using unix sort (which disk-based) , change script doesn't read entire file memory.

Search This Blog

Aleternatvie

python - Removing duplicated lines from a txt file -

Comments

Post a Comment

Popular posts from this blog

java - netbeans "Please wait - classpath scanning in progress..." -

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

openxml - Programmatically format a date in an excel sheet using Office Open Xml SDK -