Friday, September 4, 2015

Create WordCount example using mapreduce framework in hadoop using eclipse run on windows

Create WordCount example using mapreduce framework in hadoop using eclipse run on windows
1. Open Eclipse
2. Create new java project and named it as - mapreduce_demo
3. Create Java class with name WordCount.java

How map reduce will work in hadoop
Approach:1
input.txt file having the following data..
(map)
--------------------------------------------------------------------------------------
abc def ghi jkl mno pqr stu vwx    P1- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P2- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P3- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P4- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P5- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P6- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P7- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P8- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P9- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
abc def ghi jkl mno pqr stu vwx    P10- abc 1 def 1 ghi 1 jkl 1 mno-1 pqr 1  stu 1 vwx 1
---------------------------------------------------------------------------------------
map
-----
  processes one line at a time, as provided by the specified TextInputFormat
  emits a key-value pair of < , 1>.


Reducer
----------
The Reducer implementation, via the reduce method just sums up the values, which are the occurence counts for each key  
    emit(eachWord, sum)

4. Paste the below code in the WordCount.java file
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

  public static class Map extends Mapper{

    public void map(LongWritable key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        value.set(itr.nextToken());
        context.write(value, new IntWritable(1));
      }
    }
  }

  public static class Reduce extends Reducer {
   
   
    public void reduce(Text key, Iterable values,
                       Context context) throws IOException, InterruptedException {
     
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
       context.write(key, new IntWritable(sum));
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf,"mywordcount");
    
    job.setJarByClass(WordCount.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    
    Path outputPath = new Path(args[1]);
    
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    outputPath.getFileSystem(conf).delete(outputPath);
    
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
5. Set up the build path to avoid errors.
Right Click on the project(mapreduce_demo)->Properties->Java Build Path-> Libraries ->Add External Jars
select jar's from
c:\hadoop\hadoop-2.4.1\share\hadoop\common
 c:\hadoop\hadoop-2.4.1\share\hadoop\common\lib
 c:\hadoop\hadoop-2.4.1\share\hadoop\mapreduce
 c:\hadoop\hadoop-2.4.1\share\hadoop\mapreduce\lib
 
6. Once you completed the above steps, now we need to create jar to run the WordCount.java in hadoop.
7. Create Jar:

Right click on the project(mapreduce_demo)->Export->Jar(Under java)->Click Next->Next->Main Class->
Select WordCount->Click Finish
8. Now we have created jar successfully.
9. Before Executing the jar, need to start the (namenode,datanode,resourcemanager and nodemanager)
10. c:\hadoop\hadoop-2.4.1\sbin>start-dfs.cmd
11. c:\hadoop\hadoop-2.4.1\sbin>start-yarn.cmd
12. Check whether any inut files available in the hdfs system already if not create the same.

Hadoop basic commands:


c:\hadoop\hadoop-2.4.1\bin>hdfs dfs -ls input (If it is not created) then use below command to create directory
c:\hadoop\hadoop-2.4.1\bin>hdfs dfs -mkdir input
Copy any text file into input directory of hdfs
c:\hadoop\hadoop-2.4.1\bin>hdfs dfs -copyFromLocal input_file.txt input
Verify file has been copied to hdfs or not using below command
c:\hadoop\hadoop-2.4.1\bin>hdfs dfs -ls input
Verift the data of the file which you copied into hdfs
c:\hadoop\hadoop-2.4.1\bin>hdfs dfs -cat input.
 
13. Once above steps's done. Now run the mapreduce program using the following command
c:\hadoop\hadoop-2.4.1\bin>
yarn jar c:\hadoop\hadoop-2.4.1\wordcount.jar input/ output/

14. verify the result.

c:\hadoop\hadoop-2.4.1\bin>hdfs dfs -cat output
verify the status of the job details and output through web url

http://localhost:50075
http://localhost:8088/cluster


AddToAny

Contact Form

Name

Email *

Message *