Berlin Buzzwords 2013: Zeno Gantner – Recommender Systems Experiments with MyMediaLite #bbuzz

okay thank you so I will talk about offline experiments for recommender systems that is if you have several different methods and that you potentially want to deploy how can you compare them to decide which one is better and i will use the my media light package as an example throughout the talk but everything that I that I say so the core message is independent of this tool so you can even if you’re not if you do not intend to try this out you can you can still follow the talk and you will get some interesting messages except you’re already a machine learning guru something like that and most of the lessons are even not limited to recommendation to but to to it to everything where you want to make predictions from data and yeah in the end you know you will have time for questions so I work here in Berlin at nokia we work on here maps on the here maps platform where around 800 people here and this here maps platform is used to power several mobile apps by Nokia for instance nokia here drive where you can see a screenshot here so your drive is a is an app that turns your mobile phone into a kind of car navigation system there are also other apps like Hugh maps or here transit which is a public transport map applications and we also serve external customers like Microsoft’s of bing maps and Yahoo Maps that is if you go to I’d say to Bing Maps you will see on the lower right corner something that states powered by Nokia um that is we we we provide and even deliver map tiles in real time to Bing Maps bing maps are also used on facebook so it’s very likely that you’ve seen those Maps and we also work with major car companies like beam double that BMW Volkswagen Mercedes Toyota etc so basically we we deliver maps to everything that has a screen I personally I work on the search team at here maps I think it’s done it’s an awesome team we’re we’re in Boston here in Berlin we some of us participate in in open-source development for instance some of us committed a part and not committed I’m contributed to the latest version of avila seen you can you can see the links down there if you if you’re interested in details there i will also publish the slides later so you don’t have to do this now we are also regular participants of this conference here also some links of of talks that were that were given in the recent year so you can you can look look up the videos if you’re interested in what we are doing and what we are talking about so I think we’re pretty awesome team minutes it’s really fun to work here good so now back to the topic imagine your your boss wants you two to implement a recommender foreign for an existing system right so you already have a have a website you already have lots of data and you want to you want to add a personalization feature and maybe you even have some idea which which algorithms to use or even you have already implementations of those algorithms either you did it yourself or you use some open source a toolkit for instance mahout of course or there are also other alternatives like GraphLab so you pretty much already have the major building blocks to to have a to come up with a good recommender system but there is still a problem and none of so so you none of those packages provide you really ready to use solutions for exactly your use case and of course none of those come with models that are already fine-tuned

for exactly your use case so so what are you going to do you have data you have algorithms but you do not really know how to glue them together of course in the end you will have the let the user decide what’s best so of course live user testing is something that that you want to consider but very often you you need to know what to put in front of the those live users you so you cannot just try out everything so and in order to decide what to what to deploy Watership you may want to use offline experiments that means you use data that you collected on your website and you partition this data into into one part that you use for training your model and then you try to predict what is in the other part of your data and this is your test set and this is what you do and actually this is something that not only happens if you if you’re interested in in real world deployment and coming up with a real system but it is also that something that that happens in other places as well take for instance data mining competitions like the k DD cup or your Kegel who here knows Kegel ok anybody has already participated in a competition there no I think I know someone here in here what anyway good so if you if you want to participate here you also you need to compare different different different methods and you need to be sure what to submit their to to be able to compete right so this is this is also something where you can use knowledge and skills about offline experiments and of course if you’re a researcher so if you came up with a new fancy data mining or machine learning technique you want to sell it to your to your peers so you want to actually prove that this method is is worth considering so you want to compare it to existing methods so that you also need to know how to how to perform offline experiments right so the question is the same for for all these cases you have data you have methods and now how do you compare them okay i will i will use my media light the software as an example throughout the talk and then we’ll cover a few of those topics so how do you how do you deal with your with your data which methods could you compare against what do you want to avoid when comparing methods how do you measure how do you actually measure the performance of a method how do you find you in a method and how can you make sure that everything you try out is reproducible ok so my media light is a recommender system toolkit which means it contains many different recommender system algorithms like collaborative filtering but also other things different matrix factorization techniques or state-of-the-art recommender system methods and it contains also a an extensive evaluation framework which is maybe because it it comes from a more academic heritage rather than let’s say mahout but I would argue that also the simulation framework is this is a it’s practical outside of academics as well so what is my meter light so it’s it’s basically it’s a it’s a dotnet library it’s it’s written csharp in 10 so it means it it’s not limited to windows so something like that it runs basically everywhere the main development happens on Linux actually you can use this library from different programming languages like C sharp Titan Ruby F sharp if you’re more inclined to use Java there are also two to two different Java ports of the package one of those ports is used in a plugin for the rapid minor software so there you have a nice gooey and you can you can use the different recommenders inside the GUI yeah so there are there were regular releases of the software on two to three months a cycle so pretty pretty rapid actually it’s rather it’s all it’s rather simple so it’s easy to use you have a have lots of choice in selecting different methods different

ways of evaluating your methods it’s of course free software stock your man that it’s tested if you are interested in the details have a look at the website IG the source code is also available on github so if you if you have suggestions for improvements you can also send send us pull requests I said it’s it’s a library but it’s not only a library it also provides several command line tools so you can use a lot of the functionality without having to program just by using simple command-line tools that read data run the recommenders and give you results and during this talk I will mostly focus on this item recommendation tool command line tool and if you want to try out those examples you will find them all on github actually you already can find them there so I i uploaded them yesterday so you can replay everything so there’s a make file and all the all the stuff you need and you just just try you can just try out they cut the command line set of resent you and throughout the talk whenever there’s something in this light blue and my media light logo next to it this is an example how you will concretely call my media lied to to try out this this certain feature and most most lessons or most guidelines i will i will tell you will be accompanied by such a such an example okay but still everything you will hear you can you can you can do without worrying about the software okay let’s start with with the data so Ted already said something how the data can look like so in some systems you have explicit feedback where users explicitly state their preference I like this a lot i don’t like this i give this three and a half stars this is called explicit feedback the problem is it’s not always there and it’s hard to to make users contribute such feedback so more in most systems you won’t have this so we won’t look any further at this for now and what’s much more interesting is implicit feedback so those are actions by the users that are that that happened in the in the regular interaction of the user with a system and it’s not necessarily an exquisite statement of a user preference the user doesn’t by clicking on something the user doesn’t explicitly say i like this or I don’t like this it’s just it’s just something that we want to infer and such empty implicit feedback can be for instance views on websites or or clicks or purchase actions so someone buy something and very often this is positive only that means you only observe what a person buys but you do not observe what a person decides against buying because people just don’t tell this it they don’t tell you oh I don’t I didn’t buy this product because I didn’t like it yeah so we will we will concentrate on on implicit feedback and positive only feedback okay so how did this what this should look familiar to you so how does such data look like we have a we have users users can be specified by IDs can be numerical IDs could be also strings names whatever and we have items so say as a ted call them things so everything that you want to want to recommend for instance newspaper articles products etc web links item ids can also be numbers or they can be just almost arbitrary strings and maybe you have time stamps but time stems are optional maybe you didn’t record time steps maybe you’re not interested in x terms maybe you’re very interested in them and so usually you would have them in a let’s say comma-separated file right and my media light can then read this in so you you provide a training file and a test file because as i said generally for for offline experiments you have one part of your data that you use for training your models and then you have one part of the data that you try to predict where you try to predict those user item interactions and this tests data set is used to judge how well your algorithm actually works and you can just provide those two files to my meter light and it will run the experiments for you so one

thing of course that you need to take care about his debt those don’t overlap because then it’s kind of easy you you will you will have models that can exactly predict what’s in the test set if you if you already have this in the training set but i think this is something should be obvious okay so sometimes you do not have your split you have not decided yet on how to do your split because it’s not obvious how to do splits so my meter light can also do this for you the simple the most simple of split is a random split which means you just say which which ratio of your data do you want to use for evaluation and which do you want to use for training so in this example we use one quarter of our data for for testing and how does this work in detail you use your your list of user item interactions you shuffle it and then you just split it you take the first 75% for training and the rest the the other twenty-five percent for testing so this is very simple there are some problems with it for instance you do not take temporal trends into account here and you also do not use all your data for testing you’re only used was twenty five percent for testing and maybe you want to make optimal use of your data so there is a method that you can use where you can make use of all data points for evaluation and this is called cross validation or K for cross validation how does this work so you you partition your data into K different parts in this example into four different parts and you use each part once for testing and you always use the rest of the data for training so this is drawn here so first you you use the the last twenty five percent and then the next to last etc and in the end you you just average over your results and that’s it and the the advantage here is it uses really each data point for for evaluation it still does not take temporal trends into account of course you there is also a way to take such trends into account in your tests and that is a chronological split so there you do not shuffle your data but you sort it according to the timestamps that you hopefully have and then you use the past up to a certain point in time to predict the future so the future here is of course not the real future but the future at that point in time that you record it and the nice thing is that this takes trends in your data into account for instance the time of the day or the day of the week because consumer behavior may be different on weekends and during weekdays and we know that consumer behavior is different in certain seasons that’s for instance before Christmas people buy different stuff maybe they buy larger gifts than during the year and so on there are also product that can be trending so a new movie will be received differently by by users when it is new then after 10 years okay so if you want to do use this feature in my media lie to you you just tell it that you want a chronological split and you tell the ratio for instance again twenty-five percent that would mean it would use the the last twenty-five percent for validation or you can provide it with a with a date then this will be the splitting date and everything before will be used for training and everything after will be used for validation okay so what’s that this anybody knows it’s a baseline okay bad joke rhaggy baseline um and this is spelled differently but it sounds the same so what about our baseline methods based baseline methods are the methods that you that you compare against and you should always compare against something because absolute numbers have almost no meaning if you if you measure if you measure your new fancy algorithm and you you get in the end a number let’s say 0.6 what does it mean you

always need to look at at numbers in in relative comparison to other things for instance how how well does my new fancy method work if I compare it to the existing state of the art or even just a very simple method relative numbers can also have no meanings if you’re doing it wrong but I will come to this later so what do you want to kill compare against so what is a good baseline first of all you want to check against simple bass lines so one good baseline is the strongest solution it’s still very simple maybe you try to improve an existing system then then of course the existing solution if you have already an existing recommender algorithm is also your baseline you want to compare against this and you want to be better than this and also there are certain standard solutions where there may be open source implementations already like KN based collaborative filtering or plain vanilla matrix factorization methods and often you also want to compare against these because you should kind of be able to to judge why your messages is better than all of those so otherwise you usually do not have a reason to use this other method if you’re not better than standard solutions if you’re not better than the existing solution if you’re not better than something very simple why do you want to use your current method why don’t you stick with the existing solution why don’t you use something more simple so how does this work in my media light so Miami delight lets you select a recommender and and there of course also many many baseline simple baseline methods for instance you want to be better than random recommendations right it’s something that you should try to to achieve or you also want to be better if you if you have some fancy personalization method you want to be better than just taking the most popular items you can also take the the most popular items by certain attributes what I mean by that is that lets say you you have a movie recommender and a user mostly watch this comedy movies then maybe instead of showing this you this user the globally most popular movies you just show the user the the most popular comedy movies instead of action movies finest right okay apples and oranges so i already i already mentioned this sometimes a relative comparison still doesn’t tell you anything and this happens if you do compare the wrong things I mean it should be obvious that you do not compare the wrong things but maybe it’s not that obvious so it actually happens more often than you may think and it happens also to really really smart people for instance there was a paper at this year’s icml so icml is the International Conference on machine learning it’s a really high profile conferences conference so lots of math involved the the really big names in machine learning published there and so on and there was there was a paper about recommendation so I I I was interested in that and I read the paper and yeah it’s it it suggests a very new and fancy method lots of math beautiful formulas and then I had a look at at the evaluation so of course the emulation was done the standard way so you take an existing data set and you split it like like I told you and then you predict it then you try to predict the rest of the data set and then you compare it against existing methods and so for for rating prediction so this paper was about writing prediction there is this gold standard of the netflix price solution so the netflix price was there was a competition several years ago that had a prize money of 1 million dollar so it was a very competitive competition and in the end a huge team had a huge blended solution and I think it was after two years day they want this competition with a huge blended solution that consisted of hundreds of individual models and so what the authors claim in the abstract is we can see that Norma’s or that’s the name of the new method also outperforms the netflix winner arm is e arm is e is the metric that is used as well as other bass lines so they claim to to beat the

netflix price solution which is pretty remarkable especially if it’s a single model right so so I had a look at the details and sorry I have to go now a little bit into details for that I will try to explain to you the details so they have this nice picture here and so the so the lines here the those four highlighted lines are already the new method and the dashed lines are some baseline methods in this in this case it is the it’s called SVD so it’s a kind of matrix factorization method and here there is this very small dotted line that represents the netflix a price solution so n it I don’t know I don’t know whether you see this in the back of the room but the suggested method in one configuration beats this and so it’s that’s pretty remarkable but then if you read the caption of the figure it says the netflix when the arm is e is based on the qualifying set from the netflix competition while our results use a randomly sampled data set of similar size so the size is of the data set is the same but the way the sample is different because the the netflix winners they had to use past data to train the model and then predict the future whereas the authors sampled from from ratings from every there and used part of that what they sampled from everywhere for training and the rest they use for prediction so they had access to some part of the future in order to predict the future of course this makes it a little bit easier to predict the future if you know already something about the future right and if you if you see the the y-axis here the baseline results are around 0.8 e8 and if you look at the deluded literature on on the netflix price data set is that whenever you have such chronological splits that that where you really have to predict the future you see that those baseline metrics vectorization never perform below zero point nine they are always worse so lower is better here so chronological splits can be much much harder than random splits and in this case the authors just claimed oh we haven’t we have a number here that’s that’s lower than the other number so we win but they didn’t or did they mention that they measured in in a it in a different way but they still draw draw drew the conclusion that that because of that they were better but it’s like if you compare the speed of two cars you have you have this old car and you have just this new car and and then you measure the speed and you use let’s say a cross-country you wrote for the old car and then racing track for the new car and then you say okay we only need five minutes for the track it and then you compare it against the cross-country measurement I don’t know it would you be convinced about this so here I also I I was not convinced and I hope you are not convinced to and if this happens to such people so the the authors of the paper are big names and machine learning if it if it happens to those people it can happen maybe to everyone so it may be something you need to be careful about when you when you perform your own experiments and there are there are several lessons we can in here so of course first we should not compare between different kinds of splits like simple splits and chronological splits and but we also learn something that that bass lines are important because they can help us to kind of debug experiments so so what what was very good by the authors is that they not only report it their new results against the published results but they are not but they also performed experiments with baseline methods and also reported these because it allows the reader to better church what’s going on and also even if you do not publish your results it allows you to judge much better what is going on if you compared to base language does it it’s it makes it much easier to debug things okay another question is of course is whether a rating prediction is really the most important problem to work on in in 2013 but that’s a different story okay so I talked a lot about measurements and metrics but I haven’t mentioned any so how do you measure things in your

offline experiments getting the right metric isn’t really not easy and very often you can only approximately find the right metric but there there are some things that you need to keep in mind so you should really know your goal what has the final system what does the final system do what should it do what how should it help the user and this goal should influence your choice of the metric right so you need to know what you want to measure you also should be very very careful about your metric metrics you should always criticize your metrics because it can happen that that they ignore important aspects of your problem and yeah as I said they are just approximations of user behavior so you should always have or at least watch several metrics and besides using the metrics to raw numbers you should also eyeball your results that means you do not rely only on the numbers but just let your model predict some some items for some users and really look at the results because they there may be things that that you did you fail to see with your metrics for instance there are WTF results so what the free commendations there is a nice nice blog post by by Daniel tonka lang off of linkedin the link is in here you don’t really see it but I will provide the slides later so you can read it it’s it’s a blog post really really worth reading so Daniel basically suggests to to not only measure traditional metrics but also really measure the the WTF results that you see in the first k results it’s really worth reading this this blog post okay so I said there are many different metrics or what are good metrics one good metric that I want to suggest is precision at K because it’s it’s very simple so what is precision it k it’s just the number of correct items in the top k results buy coreg I mean item that is in your that is in there’s in your test data it’s not really correct and in in the real world but for the sake of the experiment it’s the correct items and the choice of K is can be is specific to your application it may depend on things like screen size and so etc and the good thing about this is very simple and it’s it’s easy to understand and it’s also easy to explain for instance to your manager there are also of course other measures like nect mean average precision etc but they all they all are already more complicated and you you need to discuss a lot more why you use them and so on and again and in my view like you can you can just provide a list of measures to the tool and it will output the experiment in those measures so one example for precision at K this is precision at for what why would it make sense for Institute have precision at four maybe you have an a restaurant recommendation application and it’s all it’s a mobile application so you have limited screen real estate and you know that on average maybe you can explain for a display for different results so you use precision at four and how do you measure it so you know in this example that the first result is bad second is good third is bad forces bad and the rest you can ignore and it just count the number of of good results in your top four and you divide it by 4 so here the precision at K is 0.5 one quarter and that’s it okay so next thing hyper parameter tuning what’s this so more every method that is let’s say more complex then the most popular has certain hyper parameters this is just those are values that you put into your into the learning or prediction algorithm that decide certain thresholds and so on for instance there are regularization parameters that control over fitting or learning rates for for gradient descent techniques and stopping criteria for iterative algorithms and so on and they all influence the quality of your recommendations and if someone comes up with a complicated method that does not have those they are there but implicitly and you don’t tune them so it’s also not a good idea and if you do not tune them properly you may have sub optimal results so you always want to want to optimize them and you also want to do this of course for your baselines because otherwise you also do not have a fair comparison so how can you do this in my media light there’s this w RM f

this a matrix factorization techniques I think it’s also available in my hood and it has one of those regularization options so you can just set recommend the options and you would set this and that’s it but but still you would have to to do some manual work to find a good value here right of course this can be automated but my suggestion again is here you just don’t get too fancy very often simple methods really do it it’s an easy script for instance to implement grid search that just tries out a lots of parameters in that in a brute force met way and this will do it in most cases there are some more advanced methods like like the simplex method but yeah s said very often grid search will do it so so what is grid search so this is a picture that I stole from from from last year’s talk by my colleague Stefan but it it gives some nice details on grid search so suppose you have two different parameters so you just try out several parameter values for each of those parameters and then you pick the D results in this case the this thing that is marked red you pick those results that give you the best best metric you pick those parameters that give you the best results so one question I often get is ok I have this I have this this one parameter and I have no idea at all to which value to set it so it could be 1,000 it could be 0.005 what value should i pick and this example gives gives gives away how to solve this so instead of having a linear scale you should use a for instance a logarithmic scale so either to the power of 10 or to the to the power of 2 so you you can you can cover large large numerical spaces with with your grid search alright and grid search has several real examples so real advantages so it’s very simple it’s brute force tits of course not an advantage but it’s also embarrassing the parallel so it’s really easy to paralyze this either on your box or on on a cluster you don’t need any fancy thing to parallelize it if you want to learn more about research what I can recommend is this practical guide to SVM classification so it’s not about recommendations but about some support vector machines but it’s a very very accessible piece that’s written by the authors of the lib SVM and live linear software and it also explains how grid search works and it explains how as we m’s work and I think it’s a really nice piece of paper yeah so this is something I can really recommend okay so the last topic or less guideline is reproducible experiments I think over Schindler who stood here yesterday he talked about also about random testing and but if you want to find box with random testing you also need to be able to reproduce your your these situations you encountered while testing randomly and justice you can do by fixing the random seed that you use for your random number generator and this just you can also do with my media light you can you can fix the random seed and with this you can either better debug your your experiments but you can also make sure for instance that you always compare on the same random split because of course the random split is also influenced by the by the choice of random seat now there may be also message that initialize there they’re modeled with with random numbers so there’s also something where you want to use this of course there are a lot of other aspects to to reproducible experiments besides picking the random seat so one is that you did you are in control of what you tried at which point in time for this I suggest that you put everything into a version control system so your data your software your scripts your configuration if you have lots of data of course you cannot put into subversion or get you need to come up with clever solutions here but you should you should have a system that enables you to to go back in history and 222 really reproduce results that you came up with because that may be your your your boss will come around the corner and ask you about this in that detail that you told him about three weeks ago oh and you just pull it out of your laptop but you do not know how to reconstruct this anymore and that’s kind of difficult if you if you want to make decisions based on data and you can also use tools build tools like make that were originally developed for for for

software you can use them for for the automation of such quantitative experiments because make a certain of advantages it knows when you went to rerun your your different data processing steps that you may have and so on and if you change something and let’s say at the beginning of the pipeline it exactly knows when to when to rerun things there’s also a nice a blog post about about using make for for data science or data experiments I can also recommend this okay and another aspect of reproducible experiments is evaluation so it’s it’s a good idea to reuse evaluation code because it’s not always easy and simple to implement evaluation protocols so that’s splitting and and also the metrics and many many details can be done differently and wrong so if it’s wrong it’s wrong anyway and you have a problem if it’s differently sometimes things are just different then you also have a problem because then you have again this apples and oranges thing so maybe you want to use the same software at least for evaluating the output of different recommenders so the recommendations can come from different tools but maybe you want to use to the same tool for for coming up with your metrics in the end and this is also something you can do with my media light so there is one recommender called the external item recommender and this this recommend has just given a file with scores and those scores can be generated by any software and then you can use everything that is already in there for for evaluation yeah so this I already told you about right so in relation protocols are not easy to get right and it ensures comparable comparability and you also may be lazy and you don’t don’t want to re-implement your metrics in every environment okay so let’s put it in a nutshell what have you seen some short guidelines or or messages from your talk from the talk that you maybe can take home with you you should split your data appropriately for your for your use case you should not compare apples and oranges you should compare again simple and but also strong base lines we’ve seen precision 8k as a simple metric that’s that’s also easy to explain grid search is a simple method for for tuning your models you should strive to make your experiments reproducible and of course my media light can help you with some of those things if you want to work on recommender system data so i encourage you to try it out good so as said you you can see all the examples on github I will upload the slides today tonight please have a look at at the home page of my meter light and maybe also at the source code if you’re interested and that’s it thank you you