Objects from memory as input for Hadoop/MapReduce? -
i working on parallelization algorithm, following:
- read several text documents total of 10k words.
- create objects every word in text corpus.
- create pair between word-objects (yes, o(n)). , return frequent pairs.
i parallelize 3. step creating pairs between first 1000 word-objects rest on fist machine, second 1000 word-objects on next machine, etc.
my question how pass objects created in 2. step mapper? far aware would require input files and hence need serialize objects (though haven't worked with before). there direct way pass objects mapper?
thanks in advance help
evgeni
update thank reading question before. serialization seems best way solve (see java.io.serializable). furthermore, have found tutorial useful read data serialized objects hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).
how parallelize steps? use #1 text documents input mapper. create object every word in mapper. in mapper key-value pair word-object pair (or object-word depending on doing). reducer can count unique pairs.
hadoop take care of bringing same keys same reducer.
Comments
Post a Comment