Objects from memory as input for Hadoop/MapReduce? -

- August 15, 2010

i working on parallelization algorithm, following:

read several text documents total of 10k words.
create objects every word in text corpus.
create pair between word-objects (yes, o(n)). , return frequent pairs.

i parallelize 3. step creating pairs between first 1000 word-objects rest on fist machine, second 1000 word-objects on next machine, etc.

my question how pass objects created in 2. step mapper? far aware would require input files and hence need serialize objects (though haven't worked with before). there direct way pass objects mapper?

thanks in advance help

evgeni

update thank reading question before. serialization seems best way solve (see java.io.serializable). furthermore, have found tutorial useful read data serialized objects hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).

how parallelize steps? use #1 text documents input mapper. create object every word in mapper. in mapper key-value pair word-object pair (or object-word depending on doing). reducer can count unique pairs.

hadoop take care of bringing same keys same reducer.

Search This Blog

Aleternatvie

Objects from memory as input for Hadoop/MapReduce? -

Comments

Post a Comment

Popular posts from this blog

delphi - TJvHidDeviceController "DevicePath" always showing "\" -

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

java - netbeans "Please wait - classpath scanning in progress..." -