More on backprop

maykulkarni · maykulkarni · commit d074798fa7c8 · 2018-01-30T14:50:50.000+05:30
diff --git a/02. Regression/7. R Squared.ipynb b/02. Regression/7. R Squared.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# $R^2$ Intuition"
+    "## $R^2$ Intuition"
    ]
   },
   {
@@ -20,16 +20,16 @@
     "R Squared is defined as\n",
     "$$ R^2 = 1 - \\frac{SS_{res}}{SS_{tot}} $$\n",
     "\n",
-    "#### So the $R^2$ basically depicts how different your model is from average model, if your model is equal to average model, the $R^2$ is 0 which is bad, but if it is accurate one, the $SS_{res}$ will be lower and $\\frac{SS_{res}}{SS_{tot}}$ will be lower, which means the $R^2$ will be higher for an accurate model \n",
+    "__So the $R^2$ basically depicts how different your model is from average model, if your model is equal to average model, the $R^2$ is 0 which is bad, but if it is accurate one, the $SS_{res}$ will be lower and $\\frac{SS_{res}}{SS_{tot}}$ will be lower, which means the $R^2$ will be higher for an accurate model__\n",
     "\n",
-    "#### Note that $R^2$ can also be negative. This occurs when your model is even worse than the average model"
+    "__ Note that $R^2$ can also be negative. This occurs when your model is even worse than the average model__"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Adjusted $R^2$\n",
+    "## Adjusted $R^2$\n",
     "\n",
     "### Problem with $R^2$\n",
     "\n",
@@ -54,26 +54,29 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Interpreting coefficients\n",
+    "## Interpreting coefficients\n",
     "\n",
     "Just because the coefficient of a variable is high, it doesn't mean it is more corelated. We should look at the units while interpreting coefficient. Best way to do it is look at the change for a unit change. For instance, if the coefficient is 0.79 we can say, for a unit change i.e. for an additional dollar added into the column, the profit will increase by 79 cents"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python [conda root]",
    "language": "python",
    "name": "conda-root-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
   }
  },
  "nbformat": 4,
diff --git a/05. Model Evaluation/R Squared.ipynb b/05. Model Evaluation/R Squared.ipynb
@@ -0,0 +1,81 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# $R^2$ Intuition"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For a given model, the sum of squared errors is calculated as\n",
+    "$$ SS_{res} = \\sum_{i=0}^n (y_i - \\hat{y_i})^2 $$\n",
+    "\n",
+    "For a model where output is always the average value of $y$ is\n",
+    "$$ SS_{tot} = \\sum_{i=0}^n (y_i - y_{avg})^2 $$\n",
+    "\n",
+    "R Squared is defined as\n",
+    "$$ R^2 = 1 - \\frac{SS_{res}}{SS_{tot}} $$\n",
+    "\n",
+    "#### So the $R^2$ basically depicts how different your model is from average model, if your model is equal to average model, the $R^2$ is 0 which is bad, but if it is accurate one, the $SS_{res}$ will be lower and $\\frac{SS_{res}}{SS_{tot}}$ will be lower, which means the $R^2$ will be higher for an accurate model \n",
+    "\n",
+    "#### Note that $R^2$ can also be negative. This occurs when your model is even worse than the average model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Adjusted $R^2$\n",
+    "\n",
+    "### Problem with $R^2$\n",
+    "\n",
+    "Hypothesis: $R^2$ will never decrease\n",
+    "\n",
+    "When you have a model with $n$ variables, the model will try to minimise the error. When you add $n + 1$ th variable, the model will try to minimise the error by assigning it a valid coefficient. If it fails to do so, i.e. if the new variable isn't helping at all, it will simply assign it a coefficient of 0. Hence, $R^2$ will never decrease. \n",
+    "\n",
+    "##### So, the problem is we will never know if the model is getting better by adding additional variables, which is an important thing to know.\n",
+    "\n",
+    "So, the solution is to use adjusted $R^2$ which is given by\n",
+    "\n",
+    "$$ R^2_{adj} = 1 - (1 - R^2) \\frac{n - 1}{n - p - 1}$$\n",
+    "\n",
+    "Where,\n",
+    "p = number of Regressors (independent variables)\n",
+    "n = sample size\n",
+    "\n",
+    "So basically, it penalizes for the number of variables you use. It is a battle between increase in $R^2$ vs the penalization brought by adding the additional variable"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Interpreting coefficients\n",
+    "\n",
+    "Just because the coefficient of a variable is high, it doesn't mean it is more corelated. We should look at the units while interpreting coefficient. Best way to do it is look at the change for a unit change. For instance, if the coefficient is 0.79 we can say, for a unit change i.e. for an additional dollar added into the column, the profit will increase by 79 cents"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda root]",
+   "language": "python",
+   "name": "conda-root-py"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/08. Neural Networks/2A. Backpropagation .ipynb b/08. Neural Networks/2A. Backpropagation .ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Learning in Neural Networks\n",
+    "## Backpropagation for humans\n",
     "\n",
     "\n",
     "This is probably the least understood algorithm in Machine Learning but is extremely intuitive. In this post we'll explore how to mathematically derive backpropagation and get an intuition how it works."
@@ -18,9 +18,9 @@
     "The learning process is simply adjusting the weights and biases that's it! The Neural Netowork does this by a process called Backpropagation. The steps are as follows:\n",
     "1. Randomly initialise weights\n",
     "2. __Forward Pass__: Predict a value using an activation function. \n",
-    "2. See how bad you're performing using loss function. \n",
-    "3. __Backward Pass__: Backpropagate the error. That is, tell your network that it's wrong, and also tell what direction it's supposed to go in order to reduce the error. This step updates the weights (here's where the network learns!)\n",
-    "4. Repeat steps 2 & 3 until the error is reasonably small or for a specified number of iterations. \n",
+    "3 See how bad you're performing using loss function. \n",
+    "4. __Backward Pass__: Backpropagate the error. That is, tell your network that it's wrong, and also tell what direction it's supposed to go in order to reduce the error. This step updates the weights (here's where the network learns!)\n",
+    "5. Repeat steps 2 & 3 until the error is reasonably small or for a specified number of iterations. \n",
     "\n",
     "Step 3 is the most important step. We'll mathematically derive the equation for updating the values. \n",
     "\n",
@@ -73,7 +73,7 @@
     "\\end{bmatrix}\n",
     "$$\n",
     "\n",
-    "And second level weights as:\n",
+    "And second layer weights as:\n",
     "$$\n",
     "\\theta_2 = \n",
     "\\begin{bmatrix}\n",
@@ -86,12 +86,12 @@
     "$$ z_1^{\\left(2\\right)}=\\theta_{10}^{\\left(1\\right)}+\\theta_{11}^{\\left(1\\right)}x_1+\\theta_{12}^{\\left(1\\right)}x_2 + ....\\text{for all the $z$s}$$\n",
     "\n",
     "All we do is:\n",
-    "$$ \\tag 3 z^{\\left(2\\right)}=\\theta^{\\left(2\\right)}\\cdot X $$\n",
+    "$$ \\tag 3 z^{\\left(2\\right)}=\\theta^{\\left(2\\right)}\\cdot X^T $$\n",
     "\n",
     "And the activity at the second layer is thus\n",
     "$$ \\tag 4 a^{\\left(2\\right)}=\\sigma\\left(z^{\\left(2\\right)}\\right) $$\n",
     "Which is the same as:\n",
-    "$$ \\tag 5 a^{\\left(2\\right)}=\\sigma\\left(\\theta^{\\left(2\\right)}\\cdot X\\right) $$\n",
+    "$$ \\tag 5 a^{\\left(2\\right)}=\\sigma\\left(\\theta^{\\left(2\\right)}\\cdot X^T\\right) $$\n",
     "\n",
     "Repeating the same step for the third layer will give us the output. \n",
     "$$ \\tag 6 z^{\\left(3\\right)}=\\theta^{\\left(2\\right)}\\cdot a^{\\left(2\\right)} $$\n",
@@ -105,7 +105,7 @@
    "source": [
     "## Forward Pass\n",
     "\n",
-    "Let's take an example of a Neural Network to solve the MNIST character recognition problem. Every image is 20x20 pixel in dimension, hence the a single input will (20x20) 400 features. Remember, that the input is the first layer, so the number of neurons in the first layer will be 400. The second layer will be the hidden layer, let's say that the number of neurons in the hidden layers is 25. And since we're predicting whether the image is a number from 0-9 there are 10 discrete outputs, hence the output layer will have 10 neurons. Each of the neuron in output layer will predict a value between 0 and 1. Since these values as probabilities, the value that has the highest probability will be the winner. \n",
+    "Let's take an example of a Neural Network to solve the MNIST character recognition problem. Every image is 20x20 pixel in dimension, hence the a single input will (20x20) 400 features. Remember, that the input is the first layer, so the number of neurons in the first layer will be 400. The second layer will be the hidden layer, let's say that the number of neurons in the hidden layers is 25. And since we're predicting whether the image is a number from 0-9 there are 10 discrete outputs, hence the output layer will have 10 neurons. Each of the neuron in output layer will predict a value between 0 and 1. Since these values are probabilities, the value that has the highest probability will be the winner. \n",
     "\n",
     "#### Dimension of (input) X = (5000, 400) \n",
     "\n",
@@ -192,7 +192,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Easier part $\\frac{\\partial J}{\\partial \\theta^{\\left(2\\right)}}$\n",
+    "## Easier part $\\frac{\\partial J}{\\partial \\theta^{\\left(2\\right)}}$\n",
     "\n",
     "Calculating $\\frac{\\partial J}{\\partial \\theta^{\\left(2\\right)}}$ is easier than calculating $\\frac{\\partial J}{\\partial \\theta^{\\left(1\\right)}}$ so we'll start by that first. We'll go step by step and try to understand what each step is accomplishing. \n",
     "\n"
@@ -219,7 +219,9 @@
     "\\frac{\\partial J}{\\partial W^{\\left(2\\right)}} &= \\frac{\\partial\\frac{1}{2}\\left(y-\\hat y\\right)^2}{\\partial W^{\\left(2\\right)}} \\\\\n",
     "\\notag\n",
     "&= (y-\\hat y)\\cdot\\left(-\\frac{\\partial \\hat y}{\\partial W^{\\left(2\\right)}}\\right)\n",
-    "\\end{align} $$\n",
+    "\\end{align} \n",
+    "$$\n",
+    "\n",
     "We have to differentiate $\\hat y$ to respect the [Chain Rule](https://www.youtube.com/watch?v=6kScLENCXLg). This minus sign in the second term comes from differentiating $-\\hat y$\n",
     "\n",
     "Using Equation (7) and (8) we have, \n",
@@ -241,16 +243,77 @@
     "In the last part of the equation we'll be differentiating $W^{\\left(2\\right)} \\cdot a^{\\left(2\\right)}$ by $W^{\\left(2\\right)}$. We know that the derivative of $4x$ with respect to $x$ is $4$ so the derivative of $W^{\\left(2\\right)} \\cdot a^{\\left(2\\right)}$  with respect $W^{\\left(2\\right)}$ will be $a^{\\left(2\\right)}$\n",
     "\n",
     "$$ \n",
-    "\\tag 9\n",
-    "\\frac{\\partial J}{\\partial W^{\\left(2\\right)}} = \\left(z-y\\right)\\cdot\\sigma'\\left(z^{\\left(3\\right)}\\right)\\cdot\\left(a^{\\left(2\\right)}\\right)$$\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(2\\right)}} = \\left(z-y\\right)\\cdot\\sigma'\\left(z^{\\left(3\\right)}\\right)\\cdot\\left(a^{\\left(2\\right)}\\right)\n",
+    "$$\n",
+    "\n",
+    "We'll denote the error term in the final layer by $\\delta^{(3)}$\n",
+    "\n",
+    "$$ \n",
+    "\\tag{9}\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(2\\right)}} = \\delta^{\\left(3\\right)}\\cdot a^{\\left(2\\right)}\n",
+    "$$\n",
     "\n",
     "Now, coming back to the summation we ignored at the top of the derivation, we're going to fix that in the implementation using an accumulator matrix which will store the errors for every row and sum it up. "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": []
+   "source": [
+    "## Sucky part $\\frac{\\partial J}{\\partial \\theta^{\\left(1\\right)}}$\n",
+    "\n",
+    "It's nearly the same as the previous step, but involves one additional step using chain rule. We'll start in the same way. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "$\n",
+    "\\begin{align}\n",
+    "\\tag {from (1)}\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(1\\right)}} &= \\frac{\\partial\\frac{1}{2}\\sum_{i=0}^m\\left(y-\\hat y\\right)^2}{\\partial W^{\\left(1\\right)}} \\\\\n",
+    "\\notag\n",
+    "&=\\frac{\\sum_{i=0}^m\\partial\\frac{1}{2}\\left(y-z\\right)^2}{\\partial W^{\\left(1\\right)}} \\\\\n",
+    "&= (y-\\hat y)\\cdot\\left(-\\frac{\\partial \\hat y}{\\partial W^{\\left(1\\right)}}\\right)\n",
+    "\\end{align}\n",
+    "$\n",
+    "Not that we've skipped the summation sign as before. Mathematicians might be cursing me at this point. \n",
+    "\n",
+    "$$\n",
+    "\\begin{align}\n",
+    "\\notag\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(1\\right)}} &= \\left(z-y\\right)\\cdot\\sigma'\\left(z^{\\left(3\\right)}\\right)\\cdot\\left(\\frac{\\partial z^{\\left(3\\right)}}{\\partial W^{\\left(1\\right)}}\\right) \\\\\n",
+    "\\end{align}\n",
+    "$$\n",
+    "\n",
+    "Things start to get a little different here. We cannot directly differentiate $z^{(3)}$ with respect to $W^{(1)}$ because $z^{(3)}$ does not directly depend on $W^{(1)}$. So we will use our good ol' chain rule again and divide it further.\n",
+    "\n",
+    "$$\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(1\\right)}} = \\left(z-y\\right)\\cdot\\sigma'\\left(z^{\\left(3\\right)}\\right)\\cdot \\frac{\\partial z^{\\left(3\\right)}}{\\partial a^{\\left(2\\right)}}\\cdot\\frac{\\partial a^{\\left(2\\right)}}{\\partial W^{\\left(1\\right)}}\n",
+    "$$\n",
+    "\n",
+    "Replacing the value of $\\delta^{(3)}$ from equation (9)\n",
+    "\n",
+    "$$\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(1\\right)}} = \\delta^{(3)} \\cdot \\frac{\\partial z^{\\left(3\\right)}}{\\partial a^{\\left(2\\right)}}\\cdot\\frac{\\partial a^{\\left(2\\right)}}{\\partial W^{\\left(1\\right)}}\n",
+    "$$\n",
+    "\n",
+    "Substituting the value of $z^{(3)}$ from equation (6)\n",
+    "\n",
+    "$$\n",
+    "\\begin{align}\n",
+    "\\notag\n",
+    "\\frac{\\partial J}{\\partial W^{\\left(1\\right)}} &= \\delta^{(3)} \\cdot \\frac{\\partial z^{\\left(3\\right)}}{\\partial a^{\\left(2\\right)}}\\cdot\\frac{\\partial a^{\\left(2\\right)}}{\\partial W^{\\left(1\\right)}} \\\\\n",
+    "&= \\delta^{(3)} \\cdot \\frac{\\partial\\left(W^{\\left(2\\right)}\\cdot a^{\\left(2\\right)}\\right)}{\\partial a^{\\left(2\\right)}} \\cdot\\frac{\\partial a^{\\left(2\\right)}}{\\partial W^{\\left(1\\right)}} \\\\\n",
+    "&= \\delta^{(3)} \\cdot W^{(2)} \\cdot \\frac{\\partial a^{\\left(2\\right)}}{\\partial W^{\\left(1\\right)}} \\\\\n",
+    "\\tag{Using (4)}\n",
+    "&= \\delta^{(3)} \\cdot W^{(2)} \\cdot \\frac{\\partial\\sigma\\left(z^{\\left(2\\right)}\\right)}{\\partial W^{\\left(1\\right)}} \\\\\n",
+    "\\tag{We've done this before}\n",
+    "&= \\delta^{(3)} \\cdot W^{(2)} \\cdot \\sigma'\\left(z^{\\left(2\\right)}\\right) \\cdot \\frac{\\partial z^{\\left(2\\right)}}{\\partial W^{\\left(1\\right)}}\n",
+    "\\end{align}\n",
+    "$$\n"
+   ]
   },
   {
    "cell_type": "code",