Hadoop & Spark學習筆記(四):大數據應用範例

字數計算(word count)

start-all.sh                            #啟動hadoopcd ~                                    #回到家目錄wget http://www.gutenberg.org/files/74/74.txt #下載湯姆歷險記英文版檔案hdfs dfs -mkdir /data                   #在hdfs上建一目錄hdfs dfs -put ~/74.txt /data/Tom.txt    #將文章上傳hdfshdfs dfs -ls /data                      #查看hdfs資料val   rawData=sc.textFile("hdfs://master:9000/data/Tom.txt",1)val novelText=rawData.zipWithIndex.filter(x=>   x._2 >= 482 && x._2 <=8859).map(x=>x._1)                   #讀入文本資料,移除不需要的段落
val remove_ph="[.,:()!?;_*$\\[\\]\n\"]"
val remove_pp=Array("'s","--"," '","' ")
#移除標點符號和雙位元資料的副程式和內容
def doRemove(s:String)={
var rlt=s.replaceAll(remove_ph, " ")
for(pp<-remove_pp){
rlt=rlt.replaceAll(pp, " ")
}
rlt
}
val tri_words=Array("the","a","an","and","but","to","of","in","at","on","for","as","up","out","by","it","or","with","not")val words=novelText.flatMap(line=>doRemove(line).split("\\s+"))val words_nt= words.map(x=>x.toLowerCase).filter(x=> x.length > 0 && !tri_words.contains(x))val result=words_nt.map(word=>(word, 1)).reduceByKey(_ + _).sortBy(x=>x._2, false)result.take(30).foreach(println)

高頻率項目(frequent items)

#建立資料集val trans=sc.makeRDD(Array(Array("牛奶","香蕉","可樂","麵包"),Array("麵包","啤酒","尿布"),Array("香蕉","牛奶","尿布","餅乾"),Array("可樂","尿布","啤酒"),Array("啤酒","小蘋果","尿布"),Array("尿布","奇異果","啤酒"),Array("可樂果","啤酒","冰淇淋","布丁","尿布")))#產生購物清單排列組合val allComb=trans.map{t=>for(i<-1 to t.length) yield {val eleCom=t.combinations(i)val kv=eleCom.map(ele=>"("+ele.sorted.mkString(",")+")")kv}}.flatMap(x=>x).flatMap(x=>x)allComb.collect#得到前10大排行榜
allComb.map(x=>(x,1)).reduceByKey(_+_).sortBy(x=>x._2,false).take(10).foreach(println)

One-hot編碼

#建立資料val dataRDD=sc.makeRDD(Array(Array("ID01","曾子","打籃球"),Array("ID02","子路","看電影"),Array("ID03","顏回","看書"),Array("ID04","宰予","露營"),Array("ID05","子貢","看書"),Array("ID06","子騫","露營")))#建立項目索引
val habbyTypeMap=dataRDD.map(h=>h(2)).distinct.sortBy(x=>x).zipWithIndex.collectAsMap
#建立one-hot矩陣
val dataWithOneHot=dataRDD.map{col=>
val hArray=Array.ofDim[Double](habbyTypeMap.size)hArray(habbyTypeMap(col(2)).toInt)=1col.slice(0,2) ++ hArray}dataWithOneHot.collect

Written by

Machine Learning / Deep Learning / Python / Flutter cakeresume.com/yanwei-liu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store