Wednesday 20 November 2013

My Cinema Knowledge: "my movies" aka Multi-language reconciliation using Freebase


In this first episode of " My Cinema Knowledge" I will try to describe my film catalog mixing private information (my disk folders) with public ones (Freebase)
I will use Ubuntu, Open Refine, a little python script and a RDF store, Virtuoso.




Step By Step How-to

  • Build a csv with all the folders names in my disk using a linux command ( find . -type d > myMovies.csv )
  • import in open refine (I used lod refine, a package including open refine and the rdf extension)
  • Extracted movie name from folder name taking only the last part of the location
  • Added a reconciliation service based on the freebase dump created previously (the making of is described in This post) imported in a Virtuoso triple store
    For this I used the SPARQL based reconciliation service feature of the RDF extension
    Using a custom reconciliation service over freebase I will not be limited to the english languages provided by the Freebase reconciliation service
  • After 10 minutes on a 8gb Ram machine, this the results (out of about 310):
    • 138 movies automatically recognized 
    • 66 movies with multiple choices (semi automatic)
    • 109 without a match
  • Reason for the missing matches are:
    • Missing in Freebase (mostly italina movies)
    • Missing italian title in Freebase
    • Missing in my Freebase copy
    • Some intermediate folder (about 15)
  • I also got a severe BUG in selecting new matches: https://github.com/fadmaa/grefine-rdf-extension/issues/82 (grrrr)
  • UPDATE!!!
    There is also a cloud based reconciliation service for freebase, now working also with italian language. It should be included in open refine but it does not work for this bug https://github.com/OpenRefine/OpenRefine/issues/805. You can make it work creating a new standard one using this address:

    http://reconcile.freebaseapps.com/reconcile 
  • Copy reconciled data in a new column 
  • Exported csv. on raw for example is:
    ./doppiati/1984 , 1984, http://rdf.freebase.com/ns/m.03kp2l
  • Transformed the csv to rdf using a Python script as simple as this  using python-rdflibsudo apt-get install python-pip (ubuntu)
    sudo pip install rdflib
    To use in this way:
    python myMoviesToRDF.py myMovies-csv.csv myMovies.ttl
  • Upload data into my RDF store
  • Enjoy data analysis NOW!
    In the first attemp i used SPARQL queries in order to get the genre ranking, the directors ranking and the director nationality ranking. A first attemp now, some more will come soon!


    An interactive version here



No comments :

Post a Comment