It’s time to upgrade Cron to Airflow – install airflow 1.9.0 on ubuntu 18.04 LTS

It’s time to upgrade Cron to Airflow – install airflow 1.9.0 on ubuntu 18.04 LTS

Airflow is an open source scheduling tool, incubated by Airbnb. Airflow is now getting popular and more Tech companies start using it.
Compared with our company’s existing scheduling tool – crontab, it provides advantageous features, such as user-friendly web UI, multi-process/distributed executions,notification when failure/re-try.
In this post, I’m going to record down my journey of airflow setup.

Continue reading “It’s time to upgrade Cron to Airflow – install airflow 1.9.0 on ubuntu 18.04 LTS”

Revisit Titanic Data using Apache Spark

Revisit Titanic Data using Apache Spark

This post is mainly to demonstrate the pyspark API (Spark 1.6.1), using Titanic dataset, which can be found here (train.csv, test.csv). Another post analysing the same dataset using R can be found here.

Content

  1. Data Loading and Parsing
  2. Data Manipulation
  3. Feature Engineering
  4. Apply Spark ml/mllib models

1. Data Loading and Parsing

Data Loading

sc is the SparkContext launched together with pyspark. Using sc.textFile, we can read csv file as text in RDD data format and data is separated by comma.

train_path='/Users/chaoranliu/Desktop/github/kaggle/titanic/train.csv'
test_path='/Users/chaoranliu/Desktop/github/kaggle/titanic/test.csv'
# Load csv file as RDD
train_rdd = sc.textFile(train_path)
test_rdd = sc.textFile(test_path)

Let’s look at the first 3 rows of the data in RDD.

train_rdd.take(3)
[u'PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked',

 u'1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S',

 u'2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C']


Continue reading “Revisit Titanic Data using Apache Spark”