It’s time to upgrade Cron to Airflow – install airflow 1.9.0 on ubuntu 18.04 LTS

Airflow is an open source scheduling tool, incubated by Airbnb. Airflow is now getting popular and more Tech companies start using it.
Compared with our company’s existing scheduling tool – crontab, it provides advantageous features, such as user-friendly web UI, multi-process/distributed executions,notification when failure/re-try.
In this post, I’m going to record down my journey of airflow setup.

Continue reading “It’s time to upgrade Cron to Airflow – install airflow 1.9.0 on ubuntu 18.04 LTS” →

Shiny + shinydashboard + googleVis = Powerful Interactive Visiualization

If you are a data scientist, who spent several weeks on developing a fantanstic model, you’d like to have an equally awesome way to visualize and demo your results. For R users, ggplots are good option, but no longer sufficient. R-shiny + shinydashboard + googleVis could be a wonderful combination for a quick demo application. Additionally, compared to Tabluea, Shiny provides extra useful functions such as file upload/download, which is helpful to allow end-users to interact with our models.

For the purpose of illustration, I just downloaded a random sample data test.csv from kaggle’s latest competitions: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data

The R shiny app is available here: https://6chaoran.shinyapps.io/nyc-taxi/

Continue reading “Shiny + shinydashboard + googleVis = Powerful Interactive Visiualization” →

Digit Recognition with Tensor Flow

This time I am going to continue with the kaggle 101 level competition – digit recogniser with deep learning tool Tensor Flow.

0. Previously

In the previous post, I used PCA and Pooling methods to reduce the dimensions of the dataset, and train with the linear SVM. Due to the limited efficiency of the R SVM package. I only sampled 500 records and performed a 10-fold cross validation. The resulting accuracy is about 82.7%

1. This Time

With new tool TensorFlow, we can address the problem differently:

Deep Learning, especially Convolutional Neural Network is well suitable for image recognition problem.
TensorFlow is a good tool to equickly build the neural network architecture and also empowers the capability of GPUs.
Convolution Layers artificially create additional features, scanning the boxes of pixel on the image.
Stochastic Gradient Descent replaces Gradient Descent. SGD takes sample of data to update the gradients, making training fast and less RAM consumption.

Fully utilized the entire dataset (42k records) with deep learning models, the accuracy easily go up above 95%.

Continue reading “Digit Recognition with Tensor Flow” →

Implementation of Model Based Recommendation System in R

Overview

The most straight forward recommendation system are either user based CF (collaborative filtering) or item based CF, which are categorized as memory based methods. User-Based CF is to recommend products based on behaviour of similar users, and the Item-Based CF is to recommend similar products from products that user purchased. No matter which method is used, the user-user or item-item similarity matrix, which could be sizable, is required to compute.

While on the contrast, a model based approach could refer to converting recommendation problem to regression, classification, learning to rank problem. Matrix Factorization, which is also known as latent factor model,SVD, is one of the most commonly used model based methods. In this post, a variety methods of CF will be discussed, including:

Gradient Descent CF (GD)
Alternating Least Square (ALS)

Continue reading “Implementation of Model Based Recommendation System in R” →

Revisit Titanic Data using Apache Spark

This post is mainly to demonstrate the pyspark API (Spark 1.6.1), using Titanic dataset, which can be found here (train.csv, test.csv). Another post analysing the same dataset using R can be found here.

Content

Data Loading and Parsing
Data Manipulation
Feature Engineering
Apply Spark ml/mllib models

1. Data Loading and Parsing

Data Loading

sc is the SparkContext launched together with pyspark. Using sc.textFile, we can read csv file as text in RDD data format and data is separated by comma.

train_path='/Users/chaoranliu/Desktop/github/kaggle/titanic/train.csv'
test_path='/Users/chaoranliu/Desktop/github/kaggle/titanic/test.csv'
# Load csv file as RDD
train_rdd = sc.textFile(train_path)
test_rdd = sc.textFile(test_path)

Let’s look at the first 3 rows of the data in RDD.

train_rdd.take(3)

[u'PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked',   u'1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S',   u'2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C']

Continue reading “Revisit Titanic Data using Apache Spark” →

Tableau Vis – Intersection Filter

Intro

If you used Tableau before, you will know that the filters in Tableau are union/or selection.
Let’s take the table below for example.
If you are going to create a filter and select product a & b, tableau will show client A,B,C and E instead of A,C. It’s because the filters will show us the list of clients who purchased product a or b, instead of product a and b.

intersection filter data table

capture1

Continue reading “Tableau Vis – Intersection Filter” →

Job Hunting Like A Data Analyst (part III) – A Simple Recommender

Mission

Continued with previous post – Explore the Job Market, this week I am going to develop a simple recommender system to find a suitable job .

Recommender

Let’s talk some background of recommendation system. A typical example of recommendation could be product recommended in the sidebar at Amazon or people you may know in Facebook.
Usually we can categorised recommender into two types:

1. Content Based Recommendation:

Content-based could mean user-based or product-based and the choice is depended on the mission of the business and information sparsity. A set of attributes are required to characterise the product or user, in order to have a customised and accurate prediction.
The problem is that sometimes there are not good way to automatically attribute new product or user.

2. Collaborative Filtering:

As the name implies, collaborative filtering will make use of both user and product information to make the recommendation.

Continue reading “Job Hunting Like A Data Analyst (part III) – A Simple Recommender” →

Job Hunting Like A Data Analyst (Part II)

Content

Continued with previous post, I’ve added some additional lines of codes to fetch the job description of each job post. This will take a bit longer time, which is about (1.5 hour) for me, because I set a delay of ~10 seconds between each request.
This week I will continue with overview picture of the job market of Data Analyst and develop a simple recommender based on skill and experience requirement.

Tools

python 2.x
python package: pandas
python package: re

1. Job Market Overview

Data Preparation:

After some time of web scraping, we will have a quite clean dataset, which consists of a big chunk of text job description. What I got is something like below:

Continue reading “Job Hunting Like A Data Analyst (Part II)” →

Email: chaoran.liu@icloud.com

LinkedIn: sg.linkedin.com/in/liuchaoran

blog in github: https://6chaoran.github.io/data-story

Job Hunting Like A Data Analyst

Motivation:

I’m currently suffering a tough time in looking for a data analyst job.
Instead of doing it in a traditional way, I am thinking why not do the job hunting just like a data analyst, by making use of the advantages of data science.

Always, the first step to fight a battle is to know your enemy. As I’m looking for a job in Singapore/China, the first thing I would like to explore is the job market in these areas.
I’m interested to know about:

Who is hiring data analyst
Which cities need data analyst most
What skills are expected
How is a data analyst described

What I will do is to start with LinkedIn Job Search to find the job posting about Data Analyst, using data science of course!

Continue reading “Job Hunting Like A Data Analyst” →

Chaoran's Data Story

data science self-learning repository

It’s time to upgrade Cron to Airflow – install airflow 1.9.0 on ubuntu 18.04 LTS

Shiny + shinydashboard + googleVis = Powerful Interactive Visiualization

Digit Recognition with Tensor Flow

0. Previously

1. This Time

Implementation of Model Based Recommendation System in R

Overview

Continue reading “Implementation of Model Based Recommendation System in R” →

Revisit Titanic Data using Apache Spark

Content

1. Data Loading and Parsing

Data Loading

Tableau Vis – Intersection Filter

Intro

Job Hunting Like A Data Analyst (part III) – A Simple Recommender

Mission

Recommender

1. Content Based Recommendation:

2. Collaborative Filtering:

Job Hunting Like A Data Analyst (Part II)

Content

Tools

1. Job Market Overview

Data Preparation:

Job Hunting Like A Data Analyst

Motivation: