若之前執行過，請先刪除之前使用的資料與輸出結果
```
$ hadoop fs -rmr output
$ hadoop fs -rm word.txt
```

下載範例程式至Linux家目錄

$ cd ~
$ git clone https://github.com/ogre0403/Hadoop-Streaming-101
$ cd ~/Hadoop-Streaming-101

將測試資料上傳至HDFS

$ cd ~/Hadoop-Streaming-101/data
$ hadoop fs -put word.txt

本範例提供 R/Python/bash 三種不同Hadoop Streaming的範例

這步示範使用R script執行word count範例程式

$ cd ~/Hadoop-Streaming-101/R_script
$ hadoop jar ../hadoop-streaming-2.6.0-cdh5.5.1.jar \
-file ./mapper.r \
-file ./reducer.r \
-mapper ./mapper.r \
-reducer ./reducer.r \
-numReduceTasks 1 \
-input word.txt \
-output output

執行過程如下：

16/12/05 13:41:57 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.r, ./reducer.r] [/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar] /tmp/streamjob8712971123088099109.jar tmpDir=null
16/12/05 13:42:01 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
   File Input Format Counters
           Bytes Read=36
   File Output Format Counters
           Bytes Written=24
16/12/05 13:42:19 INFO streaming.StreamJob: Output directory: output

這步示範使用Python script執行word count範例程式

$ cd ~/Hadoop-Streaming-101/python_script
$ hadoop jar ../hadoop-streaming-2.6.0-cdh5.5.1.jar \
-file ./map.py \
-file ./reduce.py \
-mapper ./map.py \
-reducer ./reduce.py \
-numReduceTasks 1 \
-input word.txt \
-output output

執行過程如下：

16/12/05 13:47:37 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./map.py, ./reduce.py] [/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar] /tmp/streamjob8233069967197807642.jar tmpDir=null
16/12/05 13:47:40 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
           WRONG_REDUCE=0
   File Input Format Counters
           Bytes Read=36
   File Output Format Counters
           Bytes Written=24
16/12/05 13:47:57 INFO streaming.StreamJob: Output directory: output

這步示範使用Bash script執行word count範例程式

$ cd ~/Hadoop-Streaming-101/bash_script
$ hadoop jar ../hadoop-streaming-2.6.0-cdh5.5.1.jar \
-file ./map.sh \
-file ./reduce.sh \
-mapper ./map.sh \
-reducer ./reduce.sh \
-numReduceTasks 1 \
-input word.txt \
-output output

執行過程如下：

16/12/05 13:46:42 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./map.sh, ./reduce.sh] [/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar] /tmp/streamjob2850544900133705954.jar tmpDir=null
16/12/05 13:46:46 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
   File Input Format Counters
           Bytes Read=36
   File Output Format Counters
           Bytes Written=24
16/12/05 13:47:03 INFO streaming.StreamJob: Output directory: output

檢查執行結果

$ hadoop fs -cat output/part-00000

aaa     2
bbb     1
ccc     2
ddd     1

更多Hadoop Streaming命令列所支援參數請參考 Hadoop Streaming 官方文件

Hadoop Streaming Quick Start

results matching ""

No results matching ""