1. 若之前執行過,請先刪除之前使用的資料與輸出結果

    $ hadoop fs -rmr output
    $ hadoop fs -rm word.txt
    
  2. 下載範例程式至Linux家目錄

    $ cd ~
    $ git clone https://github.com/ogre0403/Hadoop-Streaming-101
    $ cd ~/Hadoop-Streaming-101
    
  3. 將測試資料上傳至HDFS

    $ cd ~/Hadoop-Streaming-101/data
    $ hadoop fs -put word.txt
    
  4. 本範例提供 R/Python/bash 三種不同Hadoop Streaming的範例

    • 這步示範使用R script執行word count範例程式
      $ cd ~/Hadoop-Streaming-101/R_script
      $ hadoop jar ../hadoop-streaming-2.6.0-cdh5.5.1.jar \
      -file ./mapper.r \
      -file ./reducer.r \
      -mapper ./mapper.r \
      -reducer ./reducer.r \
      -numReduceTasks 1 \
      -input word.txt \
      -output output
      

      執行過程如下:

      16/12/05 13:41:57 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
      packageJobJar: [./mapper.r, ./reducer.r] [/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar] /tmp/streamjob8712971123088099109.jar tmpDir=null
      16/12/05 13:42:01 INFO mapred.FileInputFormat: Total input paths to process : 1
      ...
      ...
         File Input Format Counters
                 Bytes Read=36
         File Output Format Counters
                 Bytes Written=24
      16/12/05 13:42:19 INFO streaming.StreamJob: Output directory: output
      
    • 這步示範使用Python script執行word count範例程式
      $ cd ~/Hadoop-Streaming-101/python_script
      $ hadoop jar ../hadoop-streaming-2.6.0-cdh5.5.1.jar \
      -file ./map.py \
      -file ./reduce.py \
      -mapper ./map.py \
      -reducer ./reduce.py \
      -numReduceTasks 1 \
      -input word.txt \
      -output output
      

      執行過程如下:

      16/12/05 13:47:37 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
      packageJobJar: [./map.py, ./reduce.py] [/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar] /tmp/streamjob8233069967197807642.jar tmpDir=null
      16/12/05 13:47:40 INFO mapred.FileInputFormat: Total input paths to process : 1
      ...
      ...
                 WRONG_REDUCE=0
         File Input Format Counters
                 Bytes Read=36
         File Output Format Counters
                 Bytes Written=24
      16/12/05 13:47:57 INFO streaming.StreamJob: Output directory: output
      
    • 這步示範使用Bash script執行word count範例程式
      $ cd ~/Hadoop-Streaming-101/bash_script
      $ hadoop jar ../hadoop-streaming-2.6.0-cdh5.5.1.jar \
      -file ./map.sh \
      -file ./reduce.sh \
      -mapper ./map.sh \
      -reducer ./reduce.sh \
      -numReduceTasks 1 \
      -input word.txt \
      -output output
      

      執行過程如下:

      16/12/05 13:46:42 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
      packageJobJar: [./map.sh, ./reduce.sh] [/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar] /tmp/streamjob2850544900133705954.jar tmpDir=null
      16/12/05 13:46:46 INFO mapred.FileInputFormat: Total input paths to process : 1
      ...
      ...
         File Input Format Counters
                 Bytes Read=36
         File Output Format Counters
                 Bytes Written=24
      16/12/05 13:47:03 INFO streaming.StreamJob: Output directory: output
      
  5. 檢查執行結果

    $ hadoop fs -cat output/part-00000
    
    aaa     2
    bbb     1
    ccc     2
    ddd     1
    
  6. 更多Hadoop Streaming命令列所支援參數請參考 Hadoop Streaming 官方文件

results matching ""

    No results matching ""