Skip to content

Commit b030baf

Browse files
author
Oscar Celma
committed
first commit
0 parents  commit b030baf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+3825
-0
lines changed

AUTHORS

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Oscar Celma (ocelma __at__ gmail __dot__ com), http://ocelma.net

CHANGELOG

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
===========
2+
Version 0.1
3+
===========
4+
5+
2011-10-08
6+
7+
* Added the whole project at github

DEPENDENCIES

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
divisi2
2+
csc-pysparse
3+
numpy
4+
scipy

README

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
=============
2+
python-recsys
3+
=============
4+
5+
A python library for implementing a recommender system.
6+
7+
==================
8+
INSTALLATION NOTES
9+
==================
10+
11+
1) Dependencies
12+
13+
pyrecsys is build on top of Divisi2, with csc-pysparse (Divisi2 also requires NumPy).
14+
pyrecsys also requires SciPy.
15+
16+
To install the dependencies do something like this (Ubuntu):
17+
18+
sudo apt-get install python-scipy
19+
sudo apt-get install python-numpy
20+
sudo pip install divisi2 csc-pysparse
21+
22+
# If you don't have pip installed then do:
23+
# sudo easy_install csc-pysparse
24+
# sudo easy_install divisi2
25+
26+
2) Download
27+
28+
Download pyrecsys from github: https://github.com/ocelma/python-recsys
29+
30+
3) Install
31+
32+
tar xvfz pyrecsys.tar.gz
33+
cd pyrecsys
34+
sudo python setup.py install
35+
36+
..and you're all set! (hopefully)

doc/Makefile

+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line.
5+
SPHINXOPTS =
6+
SPHINXBUILD = sphinx-build
7+
PAPER =
8+
9+
# Internal variables.
10+
PAPEROPT_a4 = -D latex_paper_size=a4
11+
PAPEROPT_letter = -D latex_paper_size=letter
12+
ALLSPHINXOPTS = -d build/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
13+
14+
.PHONY: help clean html dirhtml pickle json htmlhelp qthelp latex changes linkcheck doctest
15+
16+
help:
17+
@echo "Please use \`make <target>' where <target> is one of"
18+
@echo " html to make standalone HTML files"
19+
@echo " dirhtml to make HTML files named index.html in directories"
20+
@echo " pickle to make pickle files"
21+
@echo " json to make JSON files"
22+
@echo " htmlhelp to make HTML files and a HTML help project"
23+
@echo " qthelp to make HTML files and a qthelp project"
24+
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
25+
@echo " changes to make an overview of all changed/added/deprecated items"
26+
@echo " linkcheck to check all external links for integrity"
27+
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
28+
29+
clean:
30+
-rm -rf build/*
31+
32+
html:
33+
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) build/html
34+
@echo
35+
@echo "Build finished. The HTML pages are in build/html."
36+
37+
dirhtml:
38+
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) build/dirhtml
39+
@echo
40+
@echo "Build finished. The HTML pages are in build/dirhtml."
41+
42+
pickle:
43+
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) build/pickle
44+
@echo
45+
@echo "Build finished; now you can process the pickle files."
46+
47+
json:
48+
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) build/json
49+
@echo
50+
@echo "Build finished; now you can process the JSON files."
51+
52+
htmlhelp:
53+
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) build/htmlhelp
54+
@echo
55+
@echo "Build finished; now you can run HTML Help Workshop with the" \
56+
".hhp project file in build/htmlhelp."
57+
58+
qthelp:
59+
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) build/qthelp
60+
@echo
61+
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
62+
".qhcp project file in build/qthelp, like this:"
63+
@echo "# qcollectiongenerator build/qthelp/Recommender.qhcp"
64+
@echo "To view the help file:"
65+
@echo "# assistant -collectionFile build/qthelp/Recommender.qhc"
66+
67+
latex:
68+
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) build/latex
69+
@echo
70+
@echo "Build finished; the LaTeX files are in build/latex."
71+
@echo "Run \`make all-pdf' or \`make all-ps' in that directory to" \
72+
"run these through (pdf)latex."
73+
74+
changes:
75+
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) build/changes
76+
@echo
77+
@echo "The overview file is in build/changes."
78+
79+
linkcheck:
80+
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) build/linkcheck
81+
@echo
82+
@echo "Link check complete; look for any errors in the above output " \
83+
"or in build/linkcheck/output.txt."
84+
85+
doctest:
86+
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) build/doctest
87+
@echo "Testing of doctests in the sources finished, look at the " \
88+
"results in build/doctest/output.txt."

doc/source/TODO.rst

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
TODO
2+
=====
3+
4+
* divisi2.make_sparse is way too slow! Need to change the way how the matrix is created
5+
6+
* Too much memory consumption using divisi2 functions. Remove dependencies from divisi2?
7+
8+
* algorithms: Gradient Descent rsvd?
9+
10+
* evaluation: add DCG
11+
12+
* Mention other Python approaches:
13+
14+
* pysuggest http://code.google.com/p/pysuggest/
15+
16+
* pyrsvd http://code.google.com/p/pyrsvd/
17+
18+
* crab https://github.com/marcelcaraciolo/crab
19+
20+
* pyflix http://pyflix.python-hosting.com/

doc/source/algorithm.rst

+158
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
Algorithms
2+
==========
3+
4+
**pyrecsys** provides, *out of the box*, some basic algorithms based on matrix factorization.
5+
6+
SVD
7+
---
8+
9+
**pyrecsys** makes use of `SVD`_ in order to decompose the input data (a matrix).
10+
Once the matrix is *reduced* into a lower dimensional space, **pyrecsys** can provide
11+
predictions, recommendations and similarity among the "elements" (being either users or
12+
items -it's just a matter of how you load the matrix data-).
13+
14+
.. _`SVD`: http://en.wikipedia.org/wiki/Singular_value_decomposition
15+
16+
Loading data
17+
~~~~~~~~~~~~
18+
19+
.. code-block:: python
20+
21+
from recsys.algorithm.factorize import SVD
22+
23+
filename = './data/movielens/ratings.dat'
24+
svd = SVD()
25+
svd.load_data(filename=filename, sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})
26+
27+
.. code-block:: python
28+
29+
from recsys.datamodel.data import Data
30+
from recsys.algorithm.factorize import SVD
31+
32+
filename = './data/movielens/ratings.dat'
33+
data = Data()
34+
format = {'col':0, 'row':1, 'value':2, 'ids': int}
35+
data.load(filename, sep='::', format=format)
36+
train, test = data.split_train_test(percent=80) # 80% train, 20% test
37+
38+
svd = SVD()
39+
svd.set_data(train)
40+
41+
Computing
42+
~~~~~~~~~
43+
44+
>>> K=100
45+
>>> svd.compute(k=K, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile=None)
46+
47+
Parameters:
48+
49+
*min_values*: remove those rows or columns (from the input matrix) that has less than 'min_values' non-zeros
50+
51+
*pre_normalize*: normalize input matrix. Possible values are *tfidf*, *rows*, *cols*, *all*.
52+
53+
**tfidf**: By default, treats the matrix as terms-by-documents. It's important, then, how the data is loaded. Use the *format* param in *svd.load_data()* to determine the order of the fields of the input file.
54+
55+
**rows**: Rescales the rows of the input matrix so that they all have unit Euclidean magnitude
56+
57+
**cols**: Rescales the columns of the input matrix so that they all have unit Euclidean magnitude
58+
59+
**all**: Rescales the rows and columns of the input matrix, by dividing both the rows and the columns by the square root of their Euclidean norm
60+
61+
*mean_center*: centering the input matrix (aka mean substraction)
62+
63+
*post_normalize*: Normalize every row of :math:`U \Sigma` to be a unit vector. Thus, row similarity (using cosine distance) returns :math:`[-1.0 .. 1.0]`
64+
65+
*savefile*: Output file to store SVD transformation (:math:`U, \Sigma, V^T` vectors)
66+
67+
Predictions
68+
~~~~~~~~~~~~~~~
69+
70+
To predict a *rating*, :math:`\hat{r}_{ui}`, SVD class reconstructs the original matrix, :math:`M^\prime = U \Sigma_k V^T`
71+
72+
Then,
73+
74+
>>> svd.predict(ITEMID, USERID, MIN_RATING=0.0, MAX_RATING=5.0)
75+
76+
equals to:
77+
78+
.. math::
79+
80+
\hat{r}_{ui} = M^\prime_{ij}
81+
82+
Recommendations
83+
~~~~~~~~~~~~~~~
84+
85+
Recommendations (i.e. unknown values in :math:`M_{ij}`) are also derived from :math:`M^\prime = U \Sigma_k V^T`. In this case,
86+
87+
>>> svd.recommend(USERID, n=10, only_unknowns=True, is_row=False)
88+
89+
returns the higher values of :math:`M^\prime_{i \cdot}` :math:`\forall_j{M_{ij}=\emptyset}`, whilst
90+
91+
>>> svd.recommend(USERID, n=10, only_unknowns=False, is_row=False)
92+
93+
returns the higher values for the user
94+
95+
Neighbourhood SVD
96+
-----------------
97+
98+
Classic Neighbourhood algorithm uses the ratings of the similar users (or
99+
items) to predict the values of the input matrix *M*.
100+
101+
.. code-block:: python
102+
103+
from recsys.algorithm.factorize import SVDNeighbourhood
104+
105+
svd = SVDNeighbourhood()
106+
svd.load_data(filename=sys.argv[1], sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})
107+
K=100
108+
svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True)
109+
110+
Predictions
111+
~~~~~~~~~~~
112+
113+
The only difference with *plain* SVD is the way how it computes the predictions :math:`\hat{r}_{ui}`
114+
115+
>>> svd.predict(ITEMID, USERID, weighted=True, MIN_VALUE=0.0, MAX_VALUE=5.0)
116+
117+
To compute the prediction, it uses this equation (u=USERID, i=ITEMID):
118+
119+
.. math::
120+
121+
\hat{r}_{ui} = \frac{\sum_{j \in S^{k}(i;u)} s_{ij} r_{uj}}{\sum_{j \in S^{k}(i;u)} s_{ij}}
122+
123+
where
124+
125+
:math:`S^k(i; u)` denotes the set of :math:`k` items rated by :math:`u`, which are most similar to :math:`i`.
126+
127+
* To compute the :math:`k` items most similar to :math:`i`, it uses the *svd.similar(i)* method. Then it gets those items that user :math:`u` has already rated
128+
129+
:math:`s_{ij}` is the similarity between :math:`i` and :math:`j`, computed using *svd.similarity(i, j)*
130+
131+
Comparison
132+
----------
133+
134+
For those who love RMSE, MAE and the like, here are some numbers comparing both SVD approaches.
135+
The evaluation uses the `Movielens`_ 1M ratings dataset, splitting the train/test dataset with ~80%-20%.
136+
137+
.. _`Movielens`: http://www.grouplens.org/node/73
138+
139+
.. note::
140+
141+
Computing svd k=100, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True
142+
143+
.. warning::
144+
145+
Because of *min_values=5*, some rows (movies) or columns (users) in the input matrix are removed. In fact, those movies that had less than 5 users who rated it, and those users that rated less than 5 movies are removed.
146+
147+
Results
148+
~~~~~~~
149+
150+
# Ratings in the Test dataset: 209,908
151+
152+
+-----------+--------+----------------+
153+
| | **SVD**| **SVD Neigh.** |
154+
+-----------+--------+----------------+
155+
| **RMSE** | 0.91811| 0.875496 |
156+
+-----------+--------+----------------+
157+
| **MAE** | 0.71703| 0.684173 |
158+
+-----------+--------+----------------+

doc/source/api.rst

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
===
2+
API
3+
===
4+
5+
.. automodule:: recsys
6+
7+
Algorithms
8+
==========
9+
10+
See some usage examples `here <algorithm.html>`_
11+
12+
Baseclass
13+
---------
14+
15+
.. autoclass:: recsys.algorithm.baseclass.Algorithm
16+
:members:
17+
18+
SVD
19+
---
20+
21+
.. autoclass:: recsys.algorithm.factorize.SVD
22+
:members:
23+
24+
SVD Neighbourhood
25+
-----------------
26+
27+
.. autoclass:: recsys.algorithm.factorize.SVDNeighbourhood
28+
:members:
29+
30+
.. SVD Neighbourhood Koren
31+
.. -----------------------
32+
33+
.. .. autoclass:: recsys.algorithm.factorize.SVDNeighbourhoodKoren
34+
.. :members:
35+
36+
Evaluation
37+
==========
38+
39+
See some `examples <evaluation.html>`_
40+
41+
.. autoclass:: recsys.evaluation.baseclass.Evaluation
42+
:members:
43+
44+
Data Model
45+
==========
46+
47+
**pyrecsys** data model includes: users, items, and its interaction.
48+
See some `datamodel examples <datamodel.html>`_
49+
50+
Item
51+
----
52+
53+
.. autoclass:: recsys.datamodel.item.Item
54+
:members:
55+
56+
User
57+
----
58+
59+
.. autoclass:: recsys.datamodel.user.User
60+
:members:
61+
62+
Data
63+
----
64+
65+
.. autoclass:: recsys.datamodel.data.Data
66+
:members:
67+
68+

0 commit comments

Comments
 (0)