hadoop - What is the alternative to DistributedCache in MapReduce program? -


It appears that the distributed cache comes in handy when you need to insert a small amount of data that your mapper Frequently used / reducer for distributed cache but in some circumstances, the data you want to put in your mapper will be huge, say more than 300 MB. What will you do in such cases? What will be the option of distributed cash in such a scenario?

  1. Distributed cache has several gigabytes by default, so 300 MB is a problem necessarily is not. (You can adjust the size in mapred-site.xml .) Getting 300 MB in the node can still be useful if your job runs frequently and there is some other churn in the cache.

  2. The second option is to keep your file on the HDFS and read the work from there. You can use the org.apache.hadoop.fs.FileSystem API to do this.

It depends how many times your job runs, how many other things are in the cache, reduce the map / ratio, and so forth.


Comments