|
| 1 | +Algorithms |
| 2 | +========== |
| 3 | + |
| 4 | +**pyrecsys** provides, *out of the box*, some basic algorithms based on matrix factorization. |
| 5 | + |
| 6 | +SVD |
| 7 | +--- |
| 8 | + |
| 9 | +**pyrecsys** makes use of `SVD`_ in order to decompose the input data (a matrix). |
| 10 | +Once the matrix is *reduced* into a lower dimensional space, **pyrecsys** can provide |
| 11 | +predictions, recommendations and similarity among the "elements" (being either users or |
| 12 | +items -it's just a matter of how you load the matrix data-). |
| 13 | + |
| 14 | +.. _`SVD`: http://en.wikipedia.org/wiki/Singular_value_decomposition |
| 15 | + |
| 16 | +Loading data |
| 17 | +~~~~~~~~~~~~ |
| 18 | + |
| 19 | +.. code-block:: python |
| 20 | +
|
| 21 | + from recsys.algorithm.factorize import SVD |
| 22 | +
|
| 23 | + filename = './data/movielens/ratings.dat' |
| 24 | + svd = SVD() |
| 25 | + svd.load_data(filename=filename, sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int}) |
| 26 | +
|
| 27 | +.. code-block:: python |
| 28 | +
|
| 29 | + from recsys.datamodel.data import Data |
| 30 | + from recsys.algorithm.factorize import SVD |
| 31 | + |
| 32 | + filename = './data/movielens/ratings.dat' |
| 33 | + data = Data() |
| 34 | + format = {'col':0, 'row':1, 'value':2, 'ids': int} |
| 35 | + data.load(filename, sep='::', format=format) |
| 36 | + train, test = data.split_train_test(percent=80) # 80% train, 20% test |
| 37 | +
|
| 38 | + svd = SVD() |
| 39 | + svd.set_data(train) |
| 40 | +
|
| 41 | +Computing |
| 42 | +~~~~~~~~~ |
| 43 | + |
| 44 | + >>> K=100 |
| 45 | + >>> svd.compute(k=K, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile=None) |
| 46 | + |
| 47 | +Parameters: |
| 48 | + |
| 49 | + *min_values*: remove those rows or columns (from the input matrix) that has less than 'min_values' non-zeros |
| 50 | + |
| 51 | + *pre_normalize*: normalize input matrix. Possible values are *tfidf*, *rows*, *cols*, *all*. |
| 52 | + |
| 53 | + **tfidf**: By default, treats the matrix as terms-by-documents. It's important, then, how the data is loaded. Use the *format* param in *svd.load_data()* to determine the order of the fields of the input file. |
| 54 | + |
| 55 | + **rows**: Rescales the rows of the input matrix so that they all have unit Euclidean magnitude |
| 56 | + |
| 57 | + **cols**: Rescales the columns of the input matrix so that they all have unit Euclidean magnitude |
| 58 | + |
| 59 | + **all**: Rescales the rows and columns of the input matrix, by dividing both the rows and the columns by the square root of their Euclidean norm |
| 60 | + |
| 61 | + *mean_center*: centering the input matrix (aka mean substraction) |
| 62 | + |
| 63 | + *post_normalize*: Normalize every row of :math:`U \Sigma` to be a unit vector. Thus, row similarity (using cosine distance) returns :math:`[-1.0 .. 1.0]` |
| 64 | + |
| 65 | + *savefile*: Output file to store SVD transformation (:math:`U, \Sigma, V^T` vectors) |
| 66 | + |
| 67 | +Predictions |
| 68 | +~~~~~~~~~~~~~~~ |
| 69 | + |
| 70 | +To predict a *rating*, :math:`\hat{r}_{ui}`, SVD class reconstructs the original matrix, :math:`M^\prime = U \Sigma_k V^T` |
| 71 | + |
| 72 | +Then, |
| 73 | + |
| 74 | + >>> svd.predict(ITEMID, USERID, MIN_RATING=0.0, MAX_RATING=5.0) |
| 75 | + |
| 76 | +equals to: |
| 77 | + |
| 78 | +.. math:: |
| 79 | +
|
| 80 | + \hat{r}_{ui} = M^\prime_{ij} |
| 81 | +
|
| 82 | +Recommendations |
| 83 | +~~~~~~~~~~~~~~~ |
| 84 | + |
| 85 | +Recommendations (i.e. unknown values in :math:`M_{ij}`) are also derived from :math:`M^\prime = U \Sigma_k V^T`. In this case, |
| 86 | + |
| 87 | + >>> svd.recommend(USERID, n=10, only_unknowns=True, is_row=False) |
| 88 | + |
| 89 | +returns the higher values of :math:`M^\prime_{i \cdot}` :math:`\forall_j{M_{ij}=\emptyset}`, whilst |
| 90 | + |
| 91 | + >>> svd.recommend(USERID, n=10, only_unknowns=False, is_row=False) |
| 92 | + |
| 93 | +returns the higher values for the user |
| 94 | + |
| 95 | +Neighbourhood SVD |
| 96 | +----------------- |
| 97 | + |
| 98 | +Classic Neighbourhood algorithm uses the ratings of the similar users (or |
| 99 | +items) to predict the values of the input matrix *M*. |
| 100 | + |
| 101 | +.. code-block:: python |
| 102 | +
|
| 103 | + from recsys.algorithm.factorize import SVDNeighbourhood |
| 104 | +
|
| 105 | + svd = SVDNeighbourhood() |
| 106 | + svd.load_data(filename=sys.argv[1], sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int}) |
| 107 | + K=100 |
| 108 | + svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True) |
| 109 | +
|
| 110 | +Predictions |
| 111 | +~~~~~~~~~~~ |
| 112 | + |
| 113 | +The only difference with *plain* SVD is the way how it computes the predictions :math:`\hat{r}_{ui}` |
| 114 | + |
| 115 | + >>> svd.predict(ITEMID, USERID, weighted=True, MIN_VALUE=0.0, MAX_VALUE=5.0) |
| 116 | + |
| 117 | +To compute the prediction, it uses this equation (u=USERID, i=ITEMID): |
| 118 | + |
| 119 | +.. math:: |
| 120 | +
|
| 121 | + \hat{r}_{ui} = \frac{\sum_{j \in S^{k}(i;u)} s_{ij} r_{uj}}{\sum_{j \in S^{k}(i;u)} s_{ij}} |
| 122 | +
|
| 123 | +where |
| 124 | + |
| 125 | +:math:`S^k(i; u)` denotes the set of :math:`k` items rated by :math:`u`, which are most similar to :math:`i`. |
| 126 | + |
| 127 | +* To compute the :math:`k` items most similar to :math:`i`, it uses the *svd.similar(i)* method. Then it gets those items that user :math:`u` has already rated |
| 128 | + |
| 129 | +:math:`s_{ij}` is the similarity between :math:`i` and :math:`j`, computed using *svd.similarity(i, j)* |
| 130 | + |
| 131 | +Comparison |
| 132 | +---------- |
| 133 | + |
| 134 | +For those who love RMSE, MAE and the like, here are some numbers comparing both SVD approaches. |
| 135 | +The evaluation uses the `Movielens`_ 1M ratings dataset, splitting the train/test dataset with ~80%-20%. |
| 136 | + |
| 137 | +.. _`Movielens`: http://www.grouplens.org/node/73 |
| 138 | + |
| 139 | +.. note:: |
| 140 | + |
| 141 | + Computing svd k=100, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True |
| 142 | + |
| 143 | +.. warning:: |
| 144 | + |
| 145 | + Because of *min_values=5*, some rows (movies) or columns (users) in the input matrix are removed. In fact, those movies that had less than 5 users who rated it, and those users that rated less than 5 movies are removed. |
| 146 | + |
| 147 | +Results |
| 148 | +~~~~~~~ |
| 149 | + |
| 150 | +# Ratings in the Test dataset: 209,908 |
| 151 | + |
| 152 | ++-----------+--------+----------------+ |
| 153 | +| | **SVD**| **SVD Neigh.** | |
| 154 | ++-----------+--------+----------------+ |
| 155 | +| **RMSE** | 0.91811| 0.875496 | |
| 156 | ++-----------+--------+----------------+ |
| 157 | +| **MAE** | 0.71703| 0.684173 | |
| 158 | ++-----------+--------+----------------+ |
0 commit comments