This post is mainly to demonstrate the pyspark API (Spark 1.6.1), using Titanic dataset, which can be found here (train.csv, test.csv). Another post analysing the same dataset using R can be found here.
Content
- Data Loading and Parsing
- Data Manipulation
- Feature Engineering
- Apply Spark ml/mllib models
1. Data Loading and Parsing
Data Loading
sc is the SparkContext launched together with pyspark. Using sc.textFile, we can read csv file as text in RDD data format and data is separated by comma.
train_path='/Users/chaoranliu/Desktop/github/kaggle/titanic/train.csv'
test_path='/Users/chaoranliu/Desktop/github/kaggle/titanic/test.csv'
# Load csv file as RDD
train_rdd = sc.textFile(train_path)
test_rdd = sc.textFile(test_path)
Let’s look at the first 3 rows of the data in RDD.
train_rdd.take(3)
[u'PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked',
u'1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S',
u'2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C']
Continue reading “Revisit Titanic Data using Apache Spark” →
1.277743
103.848016