Welcome to Data Science course again!
- Syllabus
- Introduction to Data Science
- Set up development environment (Java & Gradle)
- Form a team & create Github team repo
- Github team repo
- Data question - homework 1
- Pass unit tests
Credit: http://i.imgur.com/AfFMkHe.jpg
Data scientist is to make sense of data to make conclusion or even to predict outcome based on the data sets.
Common skill sets of data scientist:
- Math Statistics
- Python / R
- D3 or some other visualization tool
- Hadoop / Elastic Search
- Database reading knowledge (like sql query, mongo db script)
- Machine learning
Recommended readings
- Question
In this class, we will be starting by asking question. Your job this week is to do a lot of research on the question such as what data sets can you use to support your question/statement. - Acquiring
With a good question, you will likely want to start your project by acquiring data. We will be programming in our favorite language, Java. In other word, your job is to implement my data collector interface and implement detail to collect data. - Storage
After collecting huge amount of data, it's important to make decision on how to store them. To simplify our life a little bit, we will be using MongoDB as the primary database with Elastic Search as the secondary database for quick searching and exploring. - Explore & Analysis
Then, we will cover the most important part of the course, analytics. In this part, we will start by utilizing Elastic Search for quick exploring of data. In other word, you can use Elastic Search a lot to do quick searching and make sense out of this huge set of data. After having some basic knowledge of the data, we will be learning on how to process them to do analysis. We will go over some basic Python with its libraries to do some basic Machine Learning. - Communication via visualization
Once you are done analysis, we will be learning on how to create visualization based on the analysis we done earlier!
For the purpose of this course we consider the order of knowledge:
- First order: obtaining information directly from the data or metadata
- Second order: comprehension of first order knowledge
- Third order: derive inferential information or predicting an outcome that is derived from data
A good question is aim to address third order knowledge! In other word, just download a data set and get the size of data is not consider to be good question! This video dives further indepth into how to formulate a good question for Data Science.
You may use the above awesome list to find out some initial good data sets as a starting point to ask some good questions.
Kaggle is awesome machine learning or data analytics competition site. It may be interesting to see if you can resolve one of their open challenge with the techniques we learn in this class.
Google also provides some data set that you can use their BigQuery to do some processing.
Amazon hosts some data set as well!
- Volume
- Velocity
- Variety
Above 3Vs define the properties of big data. Volume refers to the size of data (GB, TB or even PB), variety refers to the number of types of data and velocity refers to how fast slow data comes in.
Interesting trending
Big data got started from 1990s to early 2000s when larger internet companies forced to invent new way to manage big volume of data. Today, most people think of Hadoop or NoSQL database like MongoDB when they of Big Data. However, the original core components of Hadoop, HDFS (Hadoop Distributed File System—for storage), MapReduce (the compute engine), and the resource manager now called YARN (Yet Another Resource Negotiator) are rooted in the batch-mode or offline processing commonplace ten to twenty years ago, where data is captured to storage and then processed periodically with batch jobs. Most search engines worked this way in the beginning. The data gathered by web crawlers was periodically processed into updated search results.
- Install Java if you have not done so
Keep it in mind you have to set up Java in
Path
variable for windows user
You should be able to dojava -version
to see 1.8 as version from here
- Install Gradle
Remember to set up
JAVA_HOME
pointing to where you install your JDK
You should be able to rungradle -v
to see gradle version
- Install Git
You should be able to find git bash under windows if you already install it
- Clone this repository
git clone
or use Github client or download as zip whatever you want
- Run
gradle test
Install OracleJDK 8 if you don't already have one.
Click on the link above (OracleJDK 8) to download Java 8. Upon completion of
download, please set up the PATH
path on your advanced environment settings from right click on your computer.
Remember to set it to your JDK bin folder
You can install brew and follow the following to install Java 8.
brew tap caskroom/cask
brew install brew-cask
brew cask install java
Install Gradle as this will be our primary build tool.
Click on the link above and install Gradle accordingly. Remember to set up PATH
variable so that your terminal knows Gradle is executable.
Also you will need to set up JAVA_HOME
pointing to your JDK folder. In example, C://jdk8/
Install via brew install gradle
assuming you have brew
installed.
- CentOS users can follow the instruction found in Github Gist.
- Ubuntu users take a look at the Ask Ubuntu Stack Exchange Tutorial.
Please run gradle -v
anywhere from terminal. You should see Gradle version as 2.12.
Clone/download the course repository, run gradle hello
after you are done. You should see
Hello Data Science
as the console output.
Once you have above environment set up, please remove all the @Ignore
from src/test/edu/csula/datascience/examples/SimpleStatsTest.java
and pass all the test from there.
What you want out of this class is gradle test
passes.
With above being done, you can start modifying your project in Eclipse. However, you are still not able to run the Gradle tasks. Therefore, you will also need this Eclipse Gradle Plugin to run the Gradle tasks (e.g. hello)
Instructions to install Gradle plugins in Eclipse
- In Eclipse Open Help >> Install New Software
- Paste a Gradle update site link -- http://dist.springsource.com/release/TOOLS/gradle -- into the "Work with" text box.
- Click the Add button at the top of the screen.
- Ensure that the option "Group Items by Category" is enabled.
- Select the top-level node 'Extensions / Gradle Integration'.
- Click "Next". This may take a while.
- Review the list of software that will be installed. Click "Next" again.
- Review and accept license agreements and Click "Finish".
Instructions to run Gradle tasks in Eclipse
- import this repository as gradle project
- Right click and run gradle task
Although I do suggest all of you to run tasks from terminal/cmd.
If you have trouble with Git/Github, you can look through this document as quick tutorial.
Still have trouble? Please feel free to raise your hand and I'll be walking around to help.
Test Driven Development or Behavior Driven Development gives you a lot more confidence of refactoring in future. Moreover, testing is often being adapted at more popular Open Source projects. Why? Because testing gives the confidence of merging codes from unknown developers.
How do we measure unit test?
In this class, I'll set up Coverall as the code coverage tool to measure how much unit tests students implement. This will give me fair amount of testing you implement for your project. Example can be seen in this repo.
- Dependency Injection
- Avoid static state
- Keep each unit small
Dependency Injection
Dependency injection doesn't need to always be done by framework like Guice. Put it simple, you can define dependency in your constructors. If you want to get fancy, you might want to use Factory pattern
to protect your constructors logic being exposed.
All in all, you want to keep your module dependency being defined in clear way so you can mock them.
In example, if you have a piece of code need to take object from database. Instead of:
public class Test {
public Map<String, Integer> countNumberOfWords() {
try (Connection c = getConnection()) {
String sql = "SELECT * FROM test";
// use connection and get list of object out
}
}
}
to:
public class Test {
public Map<String, Integer> countNumberOfWords(List<Test> tests) {
// count number of words using plain old java object
// this way, code becomes easily testable and mockable
}
}
Why testings need to be done at design phase?
When designing your functions/methods, you have to think about how to test it. What dependencies do you need for object and so on. If you do testing afterward, it simply becomes impossible to mock any dependency because they are too deep into your code.