Introduction

Welcome to Data Science course again!

Objectives

Syllabus
Introduction to Data Science
Set up development environment (Java & Gradle)
Form a team & create Github team repo

Metrics

Github team repo
Data question - homework 1
Pass unit tests

What is data science?

Credit: http://i.imgur.com/AfFMkHe.jpg

Data scientist is to make sense of data to make conclusion or even to predict outcome based on the data sets.

Common skill sets of data scientist:

Math Statistics
Python / R
D3 or some other visualization tool
Hadoop / Elastic Search
Database reading knowledge (like sql query, mongo db script)
Machine learning

Recommended readings

What is data science by O'reilly
The sexiest job at 21th century by Hbr

Class Overview

Question
In this class, we will be starting by asking question. Your job this week is to do a lot of research on the question such as what data sets can you use to support your question/statement.
Acquiring
With a good question, you will likely want to start your project by acquiring data. We will be programming in our favorite language, Java. In other word, your job is to implement my data collector interface and implement detail to collect data.
Storage
After collecting huge amount of data, it's important to make decision on how to store them. To simplify our life a little bit, we will be using MongoDB as the primary database with Elastic Search as the secondary database for quick searching and exploring.
Explore & Analysis
Then, we will cover the most important part of the course, analytics. In this part, we will start by utilizing Elastic Search for quick exploring of data. In other word, you can use Elastic Search a lot to do quick searching and make sense out of this huge set of data. After having some basic knowledge of the data, we will be learning on how to process them to do analysis. We will go over some basic Python with its libraries to do some basic Machine Learning.
Communication via visualization
Once you are done analysis, we will be learning on how to create visualization based on the analysis we done earlier!

What makes a good question?

For the purpose of this course we consider the order of knowledge:

First order: obtaining information directly from the data or metadata
Second order: comprehension of first order knowledge
Third order: derive inferential information or predicting an outcome that is derived from data

A good question is aim to address third order knowledge! In other word, just download a data set and get the size of data is not consider to be good question! This video dives further indepth into how to formulate a good question for Data Science.

Some starting points of data sets

Awesome data sets

You may use the above awesome list to find out some initial good data sets as a starting point to ask some good questions.

Kaggle

Kaggle is awesome machine learning or data analytics competition site. It may be interesting to see if you can resolve one of their open challenge with the techniques we learn in this class.

Google public data set

Google also provides some data set that you can use their BigQuery to do some processing.

AWS pubilc data set

Amazon hosts some data set as well!

What is considered to be big data?

Volume
Velocity
Variety

Above 3Vs define the properties of big data. Volume refers to the size of data (GB, TB or even PB), variety refers to the number of types of data and velocity refers to how fast slow data comes in.

Interesting trending

Big data got started from 1990s to early 2000s when larger internet companies forced to invent new way to manage big volume of data. Today, most people think of Hadoop or NoSQL database like MongoDB when they of Big Data. However, the original core components of Hadoop, HDFS (Hadoop Distributed File System—for storage), MapReduce (the compute engine), and the resource manager now called YARN (Yet Another Resource Negotiator) are rooted in the batch-mode or offline processing commonplace ten to twenty years ago, where data is captured to storage and then processed periodically with batch jobs. Most search engines worked this way in the beginning. The data gathered by web crawlers was periodically processed into updated search results.

Fast Data: Big Data Evolved By Dean Wampler, PhD

Development Environment Setup

Install Java if you have not done so

Keep it in mind you have to set up Java in Path variable for windows user
You should be able to do java -version to see 1.8 as version from here

Install Gradle

Remember to set up JAVA_HOME pointing to where you install your JDK
You should be able to run gradle -v to see gradle version

Install Git

You should be able to find git bash under windows if you already install it

Clone this repository

git clone or use Github client or download as zip whatever you want

Run gradle test

Java

Install OracleJDK 8 if you don't already have one.

Windows User

Click on the link above (OracleJDK 8) to download Java 8. Upon completion of download, please set up the PATH path on your advanced environment settings from right click on your computer.

Remember to set it to your JDK bin folder

Mac User

You can install brew and follow the following to install Java 8.

brew tap caskroom/cask
brew install brew-cask
brew cask install java

Gradle

Install Gradle as this will be our primary build tool.

Windows User

Click on the link above and install Gradle accordingly. Remember to set up PATH variable so that your terminal knows Gradle is executable.

Also you will need to set up JAVA_HOME pointing to your JDK folder. In example, C://jdk8/

Mac User

Install via brew install gradle assuming you have brew installed.

Linux User

CentOS users can follow the instruction found in Github Gist.
Ubuntu users take a look at the Ask Ubuntu Stack Exchange Tutorial.

To check Gradle is installed

Please run gradle -v anywhere from terminal. You should see Gradle version as 2.12.

Wrap Up Java Review Exercise

Clone/download the course repository, run gradle hello after you are done. You should see Hello Data Science as the console output.

Once you have above environment set up, please remove all the @Ignore from src/test/edu/csula/datascience/examples/SimpleStatsTest.java and pass all the test from there.

What you want out of this class is gradle test passes.

Eclipse Gradle plugins

With above being done, you can start modifying your project in Eclipse. However, you are still not able to run the Gradle tasks. Therefore, you will also need this Eclipse Gradle Plugin to run the Gradle tasks (e.g. hello)

Instructions to install Gradle plugins in Eclipse

In Eclipse Open Help >> Install New Software
Paste a Gradle update site link -- http://dist.springsource.com/release/TOOLS/gradle -- into the "Work with" text box.
Click the Add button at the top of the screen.
Ensure that the option "Group Items by Category" is enabled.
Select the top-level node 'Extensions / Gradle Integration'.
Click "Next". This may take a while.
Review the list of software that will be installed. Click "Next" again.
Review and accept license agreements and Click "Finish".

Instructions to run Gradle tasks in Eclipse

import this repository as gradle project
Right click and run gradle task

Although I do suggest all of you to run tasks from terminal/cmd.

Unit Testing

Test Driven Development or Behavior Driven Development gives you a lot more confidence of refactoring in future. Moreover, testing is often being adapted at more popular Open Source projects. Why? Because testing gives the confidence of merging codes from unknown developers.

How do we measure unit test?

In this class, I'll set up Coverall as the code coverage tool to measure how much unit tests students implement. This will give me fair amount of testing you implement for your project. Example can be seen in this repo.

So how do you test?

Dependency Injection
Avoid static state
Keep each unit small

Dependency Injection

Dependency injection doesn't need to always be done by framework like Guice. Put it simple, you can define dependency in your constructors. If you want to get fancy, you might want to use Factory pattern to protect your constructors logic being exposed.

All in all, you want to keep your module dependency being defined in clear way so you can mock them.

In example, if you have a piece of code need to take object from database. Instead of:

public class Test {
  public Map<String, Integer> countNumberOfWords() {
    try (Connection c = getConnection()) {
      String sql = "SELECT * FROM test";

      // use connection and get list of object out
    }
  }
}

to:

public class Test {
  public Map<String, Integer> countNumberOfWords(List<Test> tests) {
    // count number of words using plain old java object
    // this way, code becomes easily testable and mockable
  }
}

Why testings need to be done at design phase?

When designing your functions/methods, you have to think about how to test it. What dependencies do you need for object and so on. If you do testing afterward, it simply becomes impossible to mock any dependency because they are too deep into your code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduction.md

introduction.md

Introduction

Objectives

Metrics

What is data science?

Class Overview

What makes a good question?

Some starting points of data sets

What is considered to be big data?

Development Environment Setup

Java

Windows User

Mac User

Gradle

Windows User

Mac User

Linux User

To check Gradle is installed

Wrap Up Java Review Exercise

Eclipse Gradle plugins

Recommended readings for Gradle

Recommended reading for Git/Github

Unit Testing

So how do you test?

Further reading

Files

introduction.md

Latest commit

History

introduction.md

File metadata and controls

Introduction

Objectives

Metrics

What is data science?

Class Overview

What makes a good question?

Some starting points of data sets

What is considered to be big data?

Development Environment Setup

Java

Windows User

Mac User

Gradle

Windows User

Mac User

Linux User

To check Gradle is installed

Wrap Up Java Review Exercise

Eclipse Gradle plugins

Recommended readings for Gradle

Recommended reading for Git/Github

Unit Testing

So how do you test?

Further reading