Some Random

Aller au contenu | Aller au menu | Aller à la recherche

jeudi 21 juin 2012

Neo4J/Cypher Recommendation engine

This project is a basic tutorial to discover Neo4J and Cypher. It's hosted on github It is largely based on Marko A. Rodriguez Gremlin tutorial

With this project, we will build a Movie recommender using the Cypher query language

This query language has a very low barrier to entry if you have a basic theoritical knowledge of graphs. If something like (A)->(B) makes sense to you, then Cypher will as well.

Using MovieLens dataset, I built a suite of classes to:

  • Digest MovieLens dataset and load it into a Neo4J embedded database
  • Manually add full-text indexing
  • Querying the graph for movie recommendations

Please note that this project does *not* provide very accurate recommendations, but is merely a hands-on guide to Cypher. Improving recommendation should be easy enough though, by adding genre and user specific context in the query, but this is left as an exercise to the reader (or will be added in a future release, who knows :-) )

The project is Eclipse and Maven based.


Usage is as follow:

  1. Download MovieLens dataset
  2. Unpack to target/data/ directory
  3. Run es.bidul.graph.test.neo4j.CypherLoader
  4. Run es.bidul.graph.test.neo4j.IndexCreator
  5. Run es.bidul.graph.test.neo4j.CypherRecommender
  6. Run ServerStarter and go to Neo4J Webconsole
  7. Play with the dataset and experiment new queries

Let's dive into the code!

Code Autopsy

Cypher Loader

Basically, once model POJOs are loaded, putting them into the graph is just one query away:

// Create Movie node
Map<String, Object> params = new HashMap<String, Object>();
params.put("movieId", m.getMovieId());
params.put("title", m.getTitle());
String query = "CREATE (n{movieId :{movieId}, title :{title} , type: 'Movie'})";
ExecutionResult result = engine.execute(query, params);

Pretty straightforward! One query to create a node, in a similar fashion to "INSERT" statements in good old SQL.


If you had a look at the source of CypherLoader, you probably noticed that we did not create an index on Movie title's. Since we want to be able to search for recommendation based on movie titles, we have to add a full text index on all movie nodes. The first query that comes to mind is

start n=node(*) where n.type='Movie' return n;

Except this will fail because of node 0, which does not have a property "type".

start n=node(*) where has(n.type) and n.type='Movie' return n;

This is a valid query, however, how could we make it faster? Let's dissect what this query does:

  1. For every node, check if it has a type property
  2. For every matching node, check if the value of this property type equals to 'Movie'

But, wait, what properties do our POJO have in common? The answer is "none". In particular, Movie have a "title" property...

start n=node(*) where has(n.title) return n;

This query does exactly the same, except it's much faster!

Now, we retrieved all our Movie nodes, which we want to add to the fulltext lucene index. Just loop aroud the iterator from the preceding query, add each node and we're all set. Relatively small number of lines, fast execution...

Iterator<Node> n = result.columnAs("n");
for (Node node : IteratorUtil.asIterable(n)) {
 fulltextMoviesIndex.add(node, "title",

Cypher Recommender

There's not particular magic here, just some Cypher expressions:

START n=node:movies_fulltext({query}) MATCH n<-[r:rated]-user-[o:rated]->stuff 
WHERE r.stars>3 and o.stars>3 
RETURN stuff.title, avg(o.stars), count(o), (avg(o.stars)*count(o)) AS scoring 
ORDER BY scoring desc limit 5";
  1. START with nodes returned by the fulltext index query
  2. Movie n is rated by user, user rated other stuff n<-r:ated-user-o:rated->stuff It's like drawing a graph with paper and pencil!
  3. Restrict this to ratings higher than 3 stars
  4. Construct a rough scoring metrics : numbers of stars*number of ratings: a movie seen by a lot of people with average (o.stars=3) ratings, will score higher than a very highly starred movie (o.stars=5) rated by very few users
  5. Order the results and display the first 5 results

Rough Movie recommendation in a single query, using collaborative filtering.

The next steps are content-based recommendation: for instance, restrict recommendation to genres shared by the 2 movies, or restrict recommendation based on user demographics or geographical positions. The queries are easy to write,I'll probably add them to github in some time!

First Post

I'll basically throw here some interesting tidbits I gleaned over the web and elsewhere...