divendres, 27 de desembre del 2013

Neo4j 2.0 - Indexing

NOTE: this post is the unexpected continuation to my yesterday's post on Neo4J. You might want to start there.

DISCLAIMER: I'm no expert on the technology and this is more of personal notes while I keep playing around and learning.



INDEXING

So I went on with Alberto's workshop slides to learn Neo4J to refresh my memory on the features until I reached a certain slide which used indexing:

     START tom=node:node_auto_index(name="Tom Hanks")  
     MATCH (tom)-[:ACTED_IN]->()<- director="" span="">
     RETURN director.name;

I then tried to execute it and found a nasty error message:
   Index `node_auto_index` does not exist 
It is clear what the problem is: the index is missing. But considering it's the auto_index I was trying to use I assume there's no more indexing magic in Neo4J. I then started a pursuit to create an index so that I could reproduce what I had in Neo4J 1.9.x. I mean, it was clear to me now that if I wanted indexing over actors-name I would have to create it myself. So I started digging google to learn some more about indexing in 2.0.0.


Creating and using Indexes

First hit I checked on Indexing is a great webinar by Michael Hunger on new features in Neo4J. For what I could gather (in the matter of indexing) is that the main difference is that they are now truly indexes meaning once created they auto-magically maintained when data is updated/added. I deduce from that statement that this wasn't the case in previous versions. BTW, the index is maintained transactionally, the index is bound to the data transactionally.

So, to create an index you simply need to:

     CREATE INDEX ON :Actor(name)

This approach really simplifies the queries so that my original Cypher query becomes:

     MATCH (actor:Actor)-[:ACTED_IN]->()<- director="" span="">
     WHERE actor.name="Tom Hanks"
     RETURN director.name;

which is simpler and also more aligned to what a SQL-John might expect. What happens under the covers is that Neo4J detects I'm filtering by a field (name) over a labelled node (actor:Actor) and then finds out there's an index by ':Actor(name)'. So, it goes and automagically tries to use it.




But it is flawed

Turns out when I tried try to create the index using:

     CREATE INDEX ON :Actor(name)

it worked at indexing nothing because my dataset doesn't use labels. So I then tried to index anything by name:

     CREATE INDEX ON :(name)
     CREATE INDEX ON :*(name)
     CREATE INDEX ON (name)

it was a total waste of time since indexing requires labels in Neo4J 2.0.x. (insert sadface here). 




INTRODUCING LABELS!

Then, back on my quest to query using an index I noticed my only chance was to create a label and have all nodes that [:ACTED_IN] another node to be labelled as Actor. Turns out to be quite straight forward:

      MATCH (actor)-[:ACTED_IN]->(movie)
      SET actor :Actor
      RETURN actor;

This finally created my label, which unblocked my power to created indexes which allowed to query using them.



FUTURE WORK

Some doubts I need to investigate further:
  • The video mentions there's "no unique indexing yet" but the video is few months old now and is based on Neo4J 2.0.0-M0
  • there's also a mention to 'simple lookups for now' and I wonder what that might mean.
  • While reading the docs on indexing I noticed it is possible to force the usage of a given index when querying (which is wonderful and also expected by some SQL-John's).
  • I read s/where it's possible to alter the indexing technology. That's definitely worth a look at.