使用PySpark 進行word count
Step 1: 將word.txt上傳至HDFS
$ hadoop fs –put word.txt
Step 2: 啟動PySprark
$ pyspark
Step 3: 用spark-shell執行word count
>>> wordcount = sc.textFile("spark_test.txt").flatMap(lambda x: x.split()).map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y)
>>> wordcount.collect()