Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Order, context and popularity bias in next-song recommendations

Order, context and popularity bias in next-song recommendations The availability of increasingly larger multimedia collections has fostered extensive research in recommender systems. Instead of capturing general user preferences, the task of next-item recommendation focuses on revealing specific session preferences encoded in the most recent user interactions. This study focuses on the music domain, particularly on the task of music playlist continuation, a paradigmatic case of next-item recommendation. While the accuracy achieved in next-song recommendations is important, in this work we shift our focus toward a deeper understanding of fundamental playlist characteristics, namely the song order, the song context and the song popularity, and their relation to the recommendation of playlist continuations. We also propose an approach to assess the quality of the recommendations that mitigates known problems of off-line experiments for music recommender systems. Our results indicate that knowing a longer song context has a positive impact on next-song recommendations. We find that the long-tailed nature of the playlist datasets makes simple and highly expressive playlist models appear to perform comparably, but further analysis reveals the advantage of using highly expressive models. Finally, our experiments suggest that the song order is not crucial to accurately predict next-song recommendations. Keywords Music recommender systems · Music playlist continuation · Sequential recommendation · Collaborative filtering · Recurrent neural networks 1 Introduction According to interviews with practitioners and postings to a dedicated playlist-sharing website, Cunningham et al. [8] Automated music playlist continuation is a specific task in identified the choice of songs and the song order as impor- music recommender systems where the user sequentially tant aspects of the playlist curation process. As we review in receives song recommendations, producing a listening expe- Sect. 2, some approaches to automated music playlist con- rience similar to traditional radio broadcasting. Sequential tinuation take into account the current and previous songs recommendation scenarios are in fact very natural in the in the playlist and the order of the songs in the playlist to music domain. This is possibly explained by the short time recommend the next song. However, to the best of our knowl- required to listen to a song, which results in listening sessions edge, previous works do not explicitly analyze the impact typically including not one, but several songs. of exploiting this information for next-song recommenda- tions. We refer to the current and previous songs in a playlist Andreu Vall as the “song context” available to the recommender system andreu.vall@jku.at when it predicts the next song. This terminology is borrowed Massimo Quadrana from language models and should not be confused with the mquadrana@pandora.com incorporation of user’s contextual information into the rec- Markus Schedl ommender system. markus.schedl@jku.at In this work, we compare four well established and widely Gerhard Widmer used playlist models: a popularity-based model, a song-based gerhard.widmer@jku.at collaborative filtering (CF) model, a playlist-based CF model and a model based on recurrent neural networks (RNNs). Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria These playlist models are of increasing complexity and, by design, are able to exploit the song context and the song Austrian Research Institute for Artificial Intelligence, Vienna, Austria order to different extents. By analyzing and comparing their performance on different playlist continuation off-line exper- Pandora Media Inc., Oakland, CA, USA 123 102 International Journal of Multimedia Information Retrieval (2019) 8:101–113 iments, we derive insights regarding the impact that the song The Latent Markov Embedding introduced by Chen et context, the song order and the bias toward popular music al. [6] models playlists as Markov chains. It projects songs have on next-song recommendations. For the evaluation of into a Euclidean space such that the distance between two the off-line experiments, we propose to use metrics derived projected songs represents their transition probability. The from complete recommendation lists, instead of from the importance of the direction of song transitions is evalu- top K positions of recommendation lists. This provides a ated by testing a model on actual playlists and on playlists more complete view on the performance of the playlist mod- with reversed transitions, yielding comparable performance els. in both cases. McFee and Lanckriet [23] also treat playlists The remainder of this paper is organized as follows. Sec- as Markov chains, modeled as random walks on song hyper- tion 2 reviews the related work on automated music playlist graphs, where the edges are derived from multimodal song continuation. Section 3 introduces the guidelines for the features, and the weights are learned from hand-curated off-line experiments conducted throughout this work. We music playlists. The importance of modeling song transi- describe the recommendation task that the playlist models tions is assessed by learning the hypergraph weights again must fulfill and define the metrics employed to assess their but treating the playlists as a collection of song single- performance on the task. Section 4 describes the four playlist tons. When song transitions are ignored, the performance models considered. Section 5 presents the datasets of hand- degrades. These works examine the importance of account- curated music playlists on which we conduct the off-line ing for song transitions and their order, but the Markovian experiments. Section 6 elaborates on the results of the off-line assumption implies that only adjacent songs are considered. experiments and is divided into three parts, which discuss the Hariri et al. [11] represent songs by latent topics extracted impact of the song context, the popularity bias and the song from song-level social tags. Sequential pattern mining is per- order on next-song recommendations, respectively. Conclu- formed at the topic level, so that given seed songs, a next topic sions are drawn in Sect. 7. can be predicted. Re-ranking the results of a CF model with the predicted latent topics is found to outperform the plain CF model. This approach considers the ordering but only at the topic level, which is more abstract than the song level. 2 Related work Hidasietal. [12] propose for e-commerce and video streaming an approach to sequential recommendation based A well-researched approach to automated music playlist on the combination of RNNs with ranking-aware loss func- tions. This approach has gained attention and has been further continuation relies on the song content. Pairwise song simi- larities are computed on the basis of features extracted from improved and extended [13,31]. Jannach and Ludewig [14] the audio signal (possibly enriched with social tags and meta- have applied it to the task of automated music playlist data) and used to enforce content-wise smooth transitions continuation in a study that compares the performance of [10,18,21,22,25]. Recommendations based on content sim- RNN models and session-based nearest-neighbors mod- ilarity are expected to yield coherent playlists. However, els for sequential recommendation. Among other analyses, pure content-based recommendations cannot capture com- Jannach and Ludewig question whether the computational plex relations and, in fact, it does not hold in general that the complexity of RNN models is justified. Recommendation songs in a playlist should all sound similar [19]. models based on RNNs consider the full item context in sequences and are also aware of their order. Playlist continuation has also been regarded as a form of collaborative filtering (CF), making the analogy that playlists For a comprehensive survey on automated music playlist continuation, we point the interested reader to Bonnin and are equivalent to user listening histories on the basis of which songs should be recommended. Playlist-based nearest- Jannach [4] and Ricci et al. [28, chap.13]. neighbors CF models and factorization-based CF models We conducted preliminary studies preceding this work exploit the full song context when making next-song rec- analyzing the importance of the song order and the song ommendations [1,4,11]. Song-based nearest-neighbors CF context in next-song recommendations [32,33]. This paper models [29] are not common in the playlist continuation lit- further extends these works by incorporating a detailed erature. However, Hidasi et al. [12] show in the domains discussion of the proposed evaluation methodology, an addi- of e-commerce and video streaming that an item-based CF tional playlist model (namely a playlist-based CF model), an model that predicts the next item on the basis of only the cur- analysis of the impact of the popularity bias of music col- lections in next-song recommendations and more conclusive rent item can effectively deal with short histories. In general, CF models disregard the song order, but it is worth noting experiments to determine the importance of the song order. We also provide the full configurations and training details that the model presented by Aizenberg et al. [1] accounts for neighboring songs, and the model introduced by Rendle et for the playlist models. al. [27] for on-line shopping is aware of sequential behavior. 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 103 3 Evaluation methodology in the playlist and the position in the playlist for which the prediction is made. We propose an evaluation methodology based on the ability We index the ordered list of next-song candidates from 1 of playlist models at retrieving withheld playlist continua- (most likely) until N (least likely), where N is the number of tions. We conduct the next-item recommendation experiment unique songs in the dataset. A good playlist model is expected proposed by Hidasi et al. [12] and then propose approaches to rank the actual next song in top positions (rank values to interpreting the obtained results and to comparing the per- close to 1). On the other hand, a poor model would rank the formance of different playlist models. actual next song on bottom positions (large rank values). A We are aware that off-line evaluation approaches are random model would, on average, rank the actual next song approximations of the actual recommendation task, and they on positions around N /2. may not be able to fully estimate the final user satisfaction. However, the aim of this work is to understand the importance 3.2 Assessing the quality of the recommendations of the song context, the song order and the bias toward pop- ular music on next-song recommendations. In this sense, the Previous research in automated music playlist continuation proposed off-line evaluation methodology serves this pur- has summarized the distribution of attained ranks using met- pose well, because it allows the systematic comparison of rics derived from the top K positions of the ordered lists different playlist models under controlled conditions. of next-song candidates. For example, the recall at K (also named “hit rate” at K ) is defined as the proportion of times that the actual next songs in the test playlists attain a rank 3.1 Next-song recommendation experiment lower than K [4,11,14,15]. The rationale behind fixing the length K , typically to a small value, is that, in practice, A collection of music playlists is split into training and test only the top K results are of interest to the end user of the playlists. A trained playlist model is evaluated by repeating recommender system. We claim that this approach has two the following procedure over all the test playlists, which, for important limitations: (1) values of K that are reasonable clarity, we describe alongside the example depicted in Fig. 1. for on-line systems (with actual end users) are not necessar- We consider a test playlist (e.g., p = (s , s , s )). In the first ily reasonable for off-line evaluation (without end users); (2) 3 5 2 step, we show the model the first song in the playlist (s ). arbitrarily fixing a value of K provides partial and potentially The model ranks all the songs in the dataset according to misleading information. their likelihood to be the second song in the playlist. We We first discuss the first limitation. A playlist may be keep track of the rank attained by the actual second song extended by a number of potentially relevant songs, but off- in the playlist (s attains rank 3). We also keep track of the line experiments only accept the exact match to the actual fact that this is a prediction for a song in second position. In next song. The rank attained by the actual next song can the second step, we show the model the first and the second be overly pessimistic, because the ordered list of next-song actual songs in the playlist (s , s ). The model ranks all the candidates can actually contain relevant results in better posi- 3 5 songs in the dataset according to their likelihood to be the tions [22,24]. Therefore, the results of off-line evaluation third song in the playlist. We keep track of the rank attained approaches need to be understood as approximations of the by the actual third song in the playlist (s attains rank 1), etc. expected performance of the playlist models. They cannot In this way, we progress until the end of the playlist, always be interpreted in absolute terms but as a means to compare keeping track of the rank attained by the actual next song the relative performance of different models. In particular, values of K meaningful for on-line systems should not be literally transferred to off-line experiments. We now address the second limitation. Even though the playlist models and the datasets will be presented in Sects. 4 and 5, respectively, we advance here some results for the sake of illustration. Figure 2 shows the recall curves of sev- eral playlist models for values of K ranging from 1 to the maximum number of song candidates. If we chose to focus on a fix value of K , this would correspond to only observing a one-point cut of these recall curves, which is very partial Fig. 1 Illustration of the evaluation methodology. The playlist model information. Furthermore, as we have discussed, choosing a is evaluated on the test playlist p = (s , s , s ). It progresses through 3 5 2 specific value of K can become arbitrary in off-line exper- p and ranks all the songs in the dataset according to their likelihood to iments, where the user feedback is missing. Finally, the be the next song. The actual second song, s , attains rank 3. The actual third song, s , attains rank 1 2 information provided by a fix value of K is potentially mis- 123 104 International Journal of Multimedia Information Retrieval (2019) 8:101–113 (a) (b) Fig. 2 Recall curves for values of K ranging from 1 to the maximum top 10 positions are detailed in the boxes, where dots are superimposed number of songs in each playlist dataset. The circles indicate the length only to remind of the discrete nature of the displayed values (the lines K where each playlist model achieves a recall at K of 50% and corre- just connect the different recall values) spond to the median rank achieved by each model. The results on the leading because the recall curves of different playlist models Table 1 Summary of the playlist models cross each other at different values of K . That is, the best Playlist model Context length Order awareness performing playlist model would depend on the chosen value Popularity 0 ✗ of K . Song-CF 1 ✗ For these reasons, we propose to assess playlist models by Playlist-CF n ✗ examining their whole lists of ordered next-song candidates, RNN n ✓ as opposed to focusing only on an arbitrary number of top K positions in the lists. This provides a more complete view The context length is the number of songs considered by the model to of the performance of the playlist models. Even though the predict the next song (n means all the songs shown to the model). Order awareness indicates if the model regards the order of songs in playlists complete recall curves displayed in Fig. 2 are informative, we instead propose to directly compare the whole distribution of ranks attained by each playlist model. We report the distri- where P is the set of training playlists, P (s) is the subset of tr tr bution of attained ranks by means of boxplots that represent training playlists that contain the song s, and |·| denotes the the minimum, first quartile, median, third quartile, and max- number of playlists in each set. Given a test playlist, the next- imum rank values (see, e.g., Fig. 3). Alternatively, we report song candidates are ranked by their popularity, disregarding only the median rank value if this facilitates the interpretation the previous songs and their order. Despite its simplicity, the of the results (Fig. 5). popularity-based model is a competitive playlist model [4,6]. 4.2 Song-based collaborative filtering (“Song-CF”) 4 Playlist models This is a CF model based on song-to-song similarities. A song s is represented by a binary vector p that indicates the We describe the four playlist models considered in our exper- s training playlists to which it belongs. The similarity of a pair iments. By design, the models are of increasing complexity and are able to exploit the song context and the song order to of songs s, t is computed as the cosine between p and p , s t i.e., different extents (Table 1). Hyperparameter tuning, if neces- sary, is performed on validation playlists withheld from the p · p s t training playlists. sim(s, t ) = cos(p , p ) = . s t p p s t Two songs are similar if they co-occur in training playlists, 4.1 Song popularity (“Popularity”) regardless of the positions they occupy in the playlists. We follow Hidasi et al. [12] and implement the song-based CF This is a unigram model that computes the popularity of a song s according to its relative frequency in the training model such that next-song candidates are ranked according to their similarity only to the current song in the playlist, ignor- playlists, i.e., ing previous songs. This approach is relatively simple, but Hidasi et al. show its competitive performance for sequential |P (s)| tr pop(s) = , (1) recommendation on short sessions. |P | tr 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 105 4.3 Playlist-based collaborative filtering 5 Datasets (“Playlist-CF”) We evaluate the four playlist models on two datasets of hand- This is a CF model based on playlist-to-playlist similarities. curated music playlists derived from the on-line playlist- 2 3 A playlist p is represented by a binary vector s indicating the sharing platforms “Art of the Mix” and “8tracks.” Both songs that it includes. The similarity of a pair of playlists p, q platforms allow music aficionados to publish their playlists is computed as the cosine between s and s , i.e., on-line. Moreover, the Art of the Mix platform hosted forums p q and blogs for discussion about playlist curation, as well as social functionalities such as favoriting, or providing direct s · s p q sim(p, q) = cos(s , s ) = . (2) p q feedback to a user. The 8tracks platform also provides social s s p q functionalities, such as following users, liking, or comment- ing on specific playlists. Previous works in the automated The score assigned to a song s as a candidate to extend a test music playlist continuation literature have chosen to work playlist p is computed as with collections derived from the Art of the Mix and the 8tracks databases because of their presumably careful cura- tion process [4,11,15,22,23]. As an illustration of the users’ score(s, p) = sim(p, q), (3) engagement, we refer the interested reader to the study pre- q∈P (s) tr sented by Cunningham et al. [8], that analyzes posts to the Art of the Mix forums requesting advice on, for example, the where P (s) is the subset of training playlists that contain the tr choice of songs, or song ordering rules. song s. This model considers a song to be a suitable continu- The “AotM-2011” dataset [23] is a publicly available ation for playlist p if it has occurred in training playlists that playlist collection derived from the Art of the Mix database. are similar to p. The similarity of a playlist pair (Eq. 2) and Each playlist is represented by song titles and artist names, the score assigned to a candidate song to extend a playlist linked to the corresponding identifiers of the Million Song (Eq. 3) depend on the full playlist p, i.e., on the full song Dataset (MSD) [3], where available. The “8tracks” dataset context, but they disregard the song order. is a private playlists collection derived from 8tracks. Each Playlist-based CF has proven to be a competitive playlist playlist is represented by song titles and artist names. Since model [4,11,14,15]. It usually has an additional parameter we find multiple spellings for the same song–artist pairs, we defining the number of most similar training playlists on use fuzzy string matching to resolve the song titles and artist which Eq. 3 is calculated. We use all the training playlists names against the MSD, adapting the code released by Jans- because we find that this yields best performance in our son et al. [16] for a very similar task. experiments (Appendix A.3). We use the MSD as a common name space to correctly identify song–artist pairs. In both datasets, the songs that could not be resolved against the MSD are discarded, with 4.4 Recurrent neural networks (“RNN”) one of two possible approaches. The first approach consists in simply removing the non-matched songs. The original Recurrent neural networks are a class of neural network mod- playlists are preserved but with skips within them, which we els particularly suited to learn from sequential data. They ignore. The second approach consists in breaking up the orig- have a hidden state that accounts for the input at each time inal playlists into segments of consecutive matched songs, step while recurrently incorporating information from previ- yielding shorter playlists without skips. We show results ous hidden states. We point the interested reader to Lipton et obtained on playlists derived from the first approach, but al. [20] for a review of RNN models. experiments on playlists derived from the second approach We adopt the approach and implementation proposed yielded equivalent conclusions. by Hidasi et al. [12], where an RNN model with one layer We keep only the playlists with at least 3 unique artists and of gated recurrent units (GRU) [7] is combined with a loss with a maximum of 2 songs per artist. This is to discard artist- function designed to optimize the ranking of next-item rec- or album-themed playlists, which may correspond to book- ommendations. The model hyperparameters and architecture are detailed in Appendix A.4. Given a test playlist, the RNN model considers the full http://www.artofthemix.org song context and the song order and outputs a vector of song 3 https://8tracks.com scores used to rank the next-song candidates. 4 Publishing playlists and interacting with individual users are still active services on the Art of the Mix, but the forums and blogs seem to be discontinued. 1 5 https://github.com/hidasib/GRU4Rec https://labrosa.ee.columbia.edu/millionsong 123 106 International Journal of Multimedia Information Retrieval (2019) 8:101–113 Table 2 Descriptive statistics of the filtered AotM-2011 and 8tracks predicts the next song on the basis of the current song but playlist datasets. We report the distribution of playlist lengths, number disregards the previous ones, i.e., it has a context of 1 song. of artists per playlist and song frequency in the datasets (i.e., the number Playlist-CF and RNN predict the next song on the basis of of playlists in which each song occurs) the full playlist, i.e., they have full song context. Dataset Statistic min 1q med 3q max Figure 3 reports the rank distribution of each playlist model. They are split by the position in the playlist for which AotM-2011 Playlist length 5 6 7 8 34 the next-song prediction is made. We consider only predic- Artists per playlist 3 5 7 8 34 tions up to position 8, which represent roughly the 90% of Song frequency 1 8 12 20 249 all the next-song predictions made in the AotM-2011 and 8tracks Playlist length 5 5 6 7 46 the 8tracks datasets. From position 9 onward, the number Artists per playlist 3 5 6 7 41 of predictions quickly decreases and the results become less Song frequency 1 9 15 30 2320 reliable. The results in Fig. 3 show that Popularity and Song-CF do not systematically improve their predictions as they progress marking favorite artists, or saving full albums as playlists. through the playlists. This is the expected result because Pop- While these are also valid criteria, we prefer to exclude them ularity has no context, and Song-CF has a constant context in this work. We also keep only the playlists with at least 5 of 1 song. Their rank distributions remain overall stable with songs to ensure a minimum playlist length. Songs occurring fluctuations easily explained by the fact that at each posi- in less than 10 playlists are removed to ensure that the models tion the models deal with different songs. On the other hand, have sufficient observations for each song. Playlist-CF and RNN are aware of the full song context. The We randomly assign 80% of the playlists to training and results in Fig. 3 show that the performance of Playlist-CF the remaining 20% to test. As in any recommendation task clearly improves as it progresses through the playlists, and blind to item content, the songs that occur only in test playlists the performance of RNN improves slightly but steadily. This need to be removed because they cannot be modeled at indicates that Playlist-CF and RNN benefit from increasingly training time. This affects the final playlist length and song longer song contexts. frequency of the playlist datasets. In terms of absolute model performance, Song-CF is the The filtered AotM-2011 dataset has 17,178 playlists with least competitive model, slightly better but not clearly dif- 7032 unique songs by 2208 artists. The filtered 8tracks ferent than the random reference. Popularity and RNN show dataset has 76,759 playlists with 15,649 unique songs by the most competitive overall performances. Playlist-CF has 4290 artists. Table 2 reports the distribution of playlist difficulties when the song context is short, but it consistently lengths, unique artists per playlist and song frequency in the improves as it gains more context, until it eventually outper- datasets. forms Popularity. Summary of main observations: 6 Results – Playlist-CF and RNN, aware of the full song context, improve their performance as the song context grows. We assess the ability of the four considered playlist mod- – Despite its simplicity, Popularity compares to RNN and, els, Popularity, Song-CF, Playlist-CF and RNN, to recover except for long contexts, outperforms Playlist-CF. withheld playlist continuations as described in Sect. 3.By – Song-CF exhibits a poor performance. comparing the performance of the different playlist models on the same experiment, or the performance of the same model on different experiments, we reason about the impor- 6.2 Popularity bias tance of considering the song context and the song order for next-song recommendations. Furthermore, we study the The previous results pose an apparent contradiction: Popu- impact of the song popularity on the performance of the larity, unaware of the song context, performs comparably to different models. As a reference, all the results include the RNN and, overall, slightly better than Playlist-CF, both aware performance of a dummy model that ranks next-song candi- of the full song context. Is it then important or not to exploit dates at random (we call this model “Random”). the song context? Furthermore, as discussed by Jannach and The position in the playlist for which the next-song prediction is made 6.1 Song context must not be confused with the song context length of the playlist model. For example, making a next-song prediction for a song in position 5, the Recall that Popularity predicts the next song disregarding the playlist-based CF model has a context of 4 songs, while the song-based current and previous songs, i.e., it has no context. Song-CF CF still has a context of 1 song (Table 1). 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 107 2345678 2 345678 2345678 2 345678 2345678 2345678 2345678 2345678 2345678 2345678 position position (b) 8tracks dataset (a) AotM-2011 dataset Fig. 3 Song context experiments. Distribution of ranks attained by the relates to the number of songs in each dataset. The boxplots report the actual next songs in the test playlists (closer to 1 is better) for the AotM- distribution of attained ranks. Outliers are indicated with small horizon- 2011 and the 8tracks datasets. Each panel corresponds to a playlist tal marks. The number of next-song predictions made at every position model. The x-axis indicates the position in the playlist for which a pre- is annotated in the boxplots diction is made. The y-axis indicates the attained ranks, and its scale Ludewig [14], do marginal performance gains of RNN over reasonably well on the most popular songs, and it shows a Playlist-CF and Popularity justify its higher computational quick improvement as the song context grows. Its perfor- complexity? mance on long-tail songs is poorer, but it shows a slight To shed light on these two questions we deem it impor- improvement as it gains song context, until given a context tant to analyze the possible impact of the playlist datasets of at least 5 songs, it outperforms Popularity. The good per- being biased toward popular songs, which is in fact a bias formance of Playlist-CF on popular songs is not surprising ubiquitous in the music consumption domain [5]. Within our because the scoring Eq. (3) favors songs occurring in many study, we identify the popularity of a song with its frequency training playlists. However, the rather poor performance on in each of the datasets, that is, with the number of playlists long-tail songs is less expected, especially if we remember in which it occurs in each dataset. Table 2 and Fig. 4 show that our implementation of Playlist-CF considers all the train- the song frequency distribution of the AotM-2011 and the ing playlists as neighbors (Sect. 4.3), which should help to 8tracks datasets. The AotM-2011 and the 8tracks datasets counteract the large amount of non-popular songs in the present a clear popularity bias, with a vast majority of songs playlist datasets. Song-CF also performs better on popu- occurring in few playlists and a few songs occurring in many lar songs than on long-tail songs, especially in the 8tracks playlists. dataset, where the popularity bias is stronger (Table 2,Fig. 4). We consider again in Fig. 5 the performance of the four RNN is competitive, and most importantly, in contrast to the playlist models, but this time we distinguish whether the other playlist models, its performance is largely unaffected actual next songs in the test playlists were popular or not. by the popularity of the actual next songs in the test playlists. We precisely define the popularity of a song as its relative Focusing on the performance of the playlist models on frequency in the training playlists, as given by Eq. (1). The all next-song predictions (left panels in Fig. 5), Popularity left panels report the median rank obtained when all the next- seems comparable to the more sophisticated RNN. Given song predictions are considered. The central panels report the enough song context, Playlist-CF also seems to compete median rank obtained when the actual next songs belong to with RNN. However, as we have just seen, the overall strong the 10% most popular songs in the datasets. The right panels performance of Popularity and Playlist-CF is the result of report the median rank obtained when the actual next songs aggregating the accurate predictions made for a few popular belong to the 90% least popular songs in the datasets (which songs (central panels in Fig. 5) with the rather poor predic- we refer to as the “long tail”). In this particular case we report tions made for a vast majority of non-popular songs (right only the median rank instead of the whole rank distribution to panels in Fig. 5). On the contrary, the performance of RNN obtain a more compact figure that facilitates the comparison is not affected by the song popularity. This observation must of the playlist models across the different song-popularity be taken into consideration to judge whether the higher com- levels. putational complexity of the RNN model is justified, also The results in Fig. 5 show that Popularity performs out- considering the particular use case and target users of each standingly well on the most popular songs, but it makes poor recommender system. For example, the robustness of RNN to predictions for songs in the long tail. This is the natural con- the popularity bias would be crucial to assist users interested sequence of its very design (Sect. 4.1). Playlist-CF performs in discovering long-tail music. rank rank 108 International Journal of Multimedia Information Retrieval (2019) 8:101–113 (b) (a) Fig. 4 Unique songs in the AotM-2011 and the 8tracks datasets, sorted parentheses). Furthermore, examples of frequent and infrequent songs by frequency, i.e., by the number of playlists in which they occur. The in the datasets are provided, with their absolute frequency annotated in colored dots correspond to songs located at specific percentile positions, parentheses with their absolute and percentile frequencies annotated (the latter in (a) (b) Fig. 5 Popularity bias experiments. Median rank attained by the actual The x-axis indicates the position in the playlist for which a prediction is next songs in the test playlists (closer to 1 is better) for the AotM-2011 made. The y-axis indicates the attained ranks, and its scale relates to the and the 8tracks datasets. Left: all songs are considered. Center: only number of songs in each dataset. The number of next-song predictions the 10% most popular songs in the dataset are considered. Right: only made at every position is annotated the 90% least popular (long-tail) songs in the dataset are considered. Summary of main observations: robustness to dealing with infrequent music. We now inves- tigate the importance of considering the song order by – RNN exhibits a competitive performance which is not comparing the performance of RNN when it deals with orig- affected by the popularity of the actual next songs. inal playlists, and when it deals with playlists where the – Popularity, Song-CF and Playlist-CF exhibit a consider- song order has been manipulated. The rationale behind this able performance gap depending on the popularity of the experiment is the following: if the playlist datasets exhibit a actual next songs. consistent song order that RNN exploits to predict next-song – Despite its overall poor performance on non-popular recommendations, then we should observe a performance songs, Playlist-CF can exploit the song context to even- degradation when the song order is deliberately broken. tually outperform Popularity. We devise three song order manipulation experiments. Firstly, we evaluate RNN on shuffled test playlists. This can be regarded as a weak check, because RNN could still have 6.3 Song order learned patterns based on the song order at training time. Secondly, we train another instance of the RNN model but RNN is the most complex of the four playlist models con- using shuffled training playlists. We name it “RNN .” We re- sidered, and it is the only one aware of the song order. sh tune its hyperparameters to make sure that the performance Furthermore, we have shown its good performance and 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 109 7032 15649 2345678 2345678 2345678 2 345678 2345678 2 345678 2345678 2345678 position position (b) (a) 8tracks dataset AotM-2011 dataset Fig. 6 Song order experiments. Distribution of ranks attained by the y-axis indicates the attained ranks, and its scale relates to the number actual next songs in the test playlists (closer to 1 is better) for the AotM- of songs in each dataset. The boxplots report the distribution of attained 2011 and the 8tracks datasets. The panels report the results of RNN ranks. Outliers are indicated with small horizontal marks. The num- and RNN evaluated on original and shuffled test playlists. The x-axis ber of next-song predictions made at every position is annotated in the sh indicates the position in the playlist for which a prediction is made. The boxplots is not compromised as a consequence of modifying the train- is actually able to capture order information. In any case, ing playlists, but eventually we keep the same configuration we already presume that such a definite song order behavior because others do not yield consistent improvements. Then, will not occur in real situations. We hypothesize with two we evaluate RNN on original test playlists. This is a strong possible sources of variation that may better respond to how sh check, because we now make sure that RNN cannot exploit natural playlists are organized: firstly, instead of a universal sh the song order at training time. For completeness, we also song order, there may exist several song orders correspond- evaluate RNN on shuffled test playlists. ing to, for example, different underlying music taste profiles; sh Figure 6 reports the rank distribution for each song order secondly, one or several orders may exist, but they may be randomization experiment. As a reference, we also include followed in a non-strict manner. We create three additional the performance of RNN evaluated on original test playlists. synthetic datasets according to these variations. We create The rank distributions are split by the position in the playlist a playlist dataset where the song order within playlists is for which the next-song prediction is made. As before, we strictly ruled by one of five arbitrary but fixed song orders, consider only predictions up to position 8, which represent with the same number of playlists following each of the five roughly the 90% of all the next-song predictions made in the orders. We refer to this dataset as “Five orders.” We further AotM-2011 and the 8tracks datasets (Sect. 6.1). Surprisingly, create noisy versions of “One order” and “Five orders” such the rank distributions are comparable across all song order that the song order within the playlists is followed but in randomization experiments, regardless of whether the song a non-strict manner. To achieve this, we copy the original order is original, broken at test time, broken at training time, datasets but replace a randomly chosen 30% of the songs of or broken both at training and at test time. This result provides each playlist by unordered, randomly sampled songs from an indication that the song order may not be an essential outside the playlist. We name the resulting datasets “One feature for next-song recommendations, and it would agree order—30% noise” and “Five orders—30% noise.” with similar findings derived from user experiments [17]. Each of the synthetic datasets has 15,000 playlists with Alternatively, even though RNN models are the state of the 7000 unique songs, and each playlist has a length of exactly art in many sequential tasks, this result could be explained 10 songs. These specific values are chosen so that the by the incapability of this specific RNN model to properly synthetic datasets have similar characteristics to a natural col- exploit the song order. lection like AotM-2011. The concept of artist does not exist, To further investigate this question, we create synthetic and the song orders are defined arbitrarily. Using more than playlist datasets where the song order is controlled. We five song orders, or a noise factor higher than 30%, yielded start creating a playlist dataset where the song order within very challenging datasets that were not as illustrative as the playlists is strictly ruled by an arbitrary but fixed universal created ones. The synthetic datasets are split into training song order. This dataset, which we name “One order,” will let and test playlists as described in Sect. 5. We then use them to us determine whether the considered RNN model (Sect. 4.4) conduct the song order randomization experiments described rank rank 110 International Journal of Multimedia Information Retrieval (2019) 8:101–113 Fig. 7 Randomized song order experiments with synthetic datasets. tion in the playlist for which a prediction is made. The y-axis indicates Distribution of ranks attained by the actual next songs in the synthetic the attained ranks, and its scale relates to the number of songs in each test playlists (closer to 1 is better) for the AotM-2011 and the 8tracks dataset. The boxplots report the distribution of attained ranks. Outliers datasets. The panels report the results of RNN and RNN evaluated are indicated with small horizontal marks. The number of next-song sh on original and shuffled test playlists. The x-axis indicates the posi- predictions made at every position is always exactly 3000 at the beginning of this section. Figure 7 reports the rank dis- dict songs in the proximity of the current song, meaning songs tribution for each song order randomization experiment, for that are few positions before or after the current song within each synthetic playlist dataset. the universal song order. In other words, training the RNN We first analyze the results obtained on “One order” model on shuffled playlists works as a regularization that (Fig. 7a), which should let us determine whether the consid- favors learning song proximity rather than strict song order. ered RNN model is actually able to capture order information. We have found evidence that the playlists in the AotM- The performance of RNN on original test playlists is perfect, 2011 and the 8tracks datasets are not ruled by a strict, with all ranks equal to 1. This shows that the considered universal song order. In fact, since we train dedicated RNN model would be able to capture a universal song order instances of the RNN model for each dataset, we know that if there were one. Consequently, we can conclude that the the playlists are also not ruled by a strict, dataset-specific playlists in the AotM-2011 and the 8tracks datasets are not song order. However, this does not imply that the playlists strictly ordered; otherwise, RNN would have been able to are, on the other end, completely unordered. There may perfectly extend them. Again in Fig. 7a, precisely because exist intermediate song order rules, which in real situations RNN learned the song order strictly, its performance on shuf- may further be affected by different sources of uncertainty. fled test playlists is comparatively very poor. We now move We examine the results obtained on the remaining synthetic on to RNN . The performance of RNN is not perfect but datasets (Fig. 7b–d) as an approximation to the more com- sh sh very good on both original and shuffled test playlists. This plex song order patterns that may underlie the playlists of the suggests that RNN follows a different strategy than RNN. AotM-2011 and the 8tracks datasets. sh Since RNN is trained on shuffled playlists, the strict song Figure 7b, c shows an overall, noticeable performance sh order is not anymore enforced. Instead, RNN learns to pre- degradation compared to Fig. 7a. Still, RNN evaluated on sh 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 111 original test playlists performs almost perfectly for “Five 7 Conclusion orders” and competitively for “One order—30% noise.” The performance degrades strongly when RNN is evaluated on We have explicitly investigated the impact of the song order, shuffled test playlists, which suggests that it had indeed cap- the song context and the popularity bias in music playlists for tured song order patterns. RNN again shows that training the task of predicting next-song recommendations. We have sh on shuffled playlists provides a regularization effect, because conducted dedicated off-line experiments on two datasets its performance on original and shuffled test playlists is of hand-curated music playlists comparing the following comparable. Figure 7d reports the results on the most chal- playlist models: a popularity-based model, a song-based CF lenging of the synthetic datasets and exhibits a generalized, model, a playlist-based CF model, and an RNN model. These clear performance degradation. While we cannot derive fur- models are well established and widely used and exploit the ther conclusions regarding the nature of the playlists in the song context and the song order to different extents. Our AotM-2011 and the 8tracks datasets, the comparison between results indicate that the playlist-based CF model and the RNN Figs. 6 and 7 suggests that real playlists may indeed be sub- model, which can consider the full song context, do benefit ject to complex, noisy song order rules. from increasingly longer song contexts. However, we observe We know that the playlists in the AotM-2011 and the that a longer song context does not necessarily translate into 8tracks datasets do not follow simple song order patterns, outperforming the simpler popularity-based model, which is and thus they could be either completely unordered or ruled unaware of the song context. This is explained by the popular- by complex, noisy song order rules. In both cases, the song ity bias in the datasets, i.e., the coexistence of few, popular order experiments on the natural datasets show that RNN songs with many, non-popular songs. Failing to take into performs equivalently to RNN (Fig. 6), which could be the account the popularity bias masks important performance sh result of RNN adopting the same strategy of RNN , that is, differences: the popularity-based model, the song-based CF sh relying on song proximity patterns rather than on complex or model and the playlist-based CF model exhibit considerable inexistent song order patterns. Since the AotM-2011 and the differences in performance depending on the popularity of 8tracks datasets do not have a strict song order, the concept of the actual next songs in the test playlists. On the contrary, the proximity could be understood as fitness, meaning that RNN more complex RNN model has a stable performance regard- and RNN could be predicting next songs that fit well the less of the song popularity. This effect must be taken into sh playlist being extended. This could also explain why RNN account in the design of playlist models for specific use cases and RNN improve their performance as the song context and target users. The RNN model is the only of the consid- sh grows, namely because a longer song context better specifies ered playlist models aware of the song order. We have found the playlist under consideration. A longer song context would that its performance on original and shuffled playlists is com- not necessarily translate into performance improvements if parable, suggesting either that the song order is not crucial RNN could rely on clear song order patterns, in which case for next-song recommendations, or that the RNN model is knowing a single song would suffice to accurately predict the unable to fully exploit it. We have further investigated this next one (see the performance of RNN evaluated on original question by evaluating the RNN model on synthetic datasets playlists in Fig. 7a, c). with controlled song orders. We have found that the RNN model is able to capture a universal song order if there is one. This implies that the natural playlists datasets considered do not follow a strict song order, although they might be ruled by complex, noisy song order rules. Finally, and regarding Summary of main observations: the evaluation methodology, we have proposed an approach – RNN achieves comparable performance on original and to assess the quality of the recommendations that observes on shuffled test playlists. It also compares to RNN , sh the complete recommendation lists instead of focusing on the which is completely unaware of the song order. top K recommendations. Doing so provides a more complete – Experiments on strictly ordered synthetic datasets show view on the performance of the playlist models. that RNN can learn song order patterns. Acknowledgements We thank Bruce Ferwerda, Rainer Kelz, Rocío – From the previous, we conclude that the AoM-2011 and del Río Lorenzo, and David Sears for their valuable feedback. This the 8tracks datasets are not strictly ordered. research has received funding from the European Research Council – Further experiments on synthetic datasets suggest that (ERC) under the European Union’s Horizon 2020 research and innova- the AotM-2011 and the 8tracks datasets might be ruled tion programme under grant agreement No 670035 (Con Espressione). Open access funding provided by Johannes Kepler University Linz. by complex, noisy song order rules. – When RNN relies on song fitness patterns rather than on Open Access This article is distributed under the terms of the Creative song order patterns, it benefits from longer song contexts, Commons Attribution 4.0 International License (http://creativecomm which better identify the playlist being extended. ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, 123 112 International Journal of Multimedia Information Retrieval (2019) 8:101–113 and reproduction in any medium, provided you give appropriate credit final configurations use it. We tune the number of units, the to the original author(s) and the source, provide a link to the Creative learning rate, the batch size, the amount of momentum, the Commons license, and indicate if changes were made. L2-regularization weight and the dropout probability on a withheld validation set, by running 100 random search exper- iments [2] for each of the loss functions mentioned above. A Model configurations The final model configuration is chosen according to the val- idation recall at 100 (i.e., the proportion of times that the A.1 Song popularity actual next song is included within the top 100 ranked can- didates), which we consider a proxy of the model’s ability to This model computes the popularity of all the unique songs rank the actual next songs in top positions. The number of in the dataset, that is, 7032 songs for the AotM-2011 dataset training epochs is chosen on the basis of the validation loss. and 15,649 songs for the 8tracks dataset. For the AotM-2011 dataset, the final model uses the TOP-1 loss and it has 200 hidden units. It is trained on mini-batches A.2 Song-based collaborative filtering of 16 playlists, with a learning rate of 0.01, a momentum coefficient of 0.5 and an L2-regularization weight of 0.1. This model computes pairwise similarities for all the unique For the 8tracks dataset, the final model uses the TOP-1 loss songs in the dataset, that is, 7032 songs for the AotM-2011 and it has 200 hidden units. It is trained on mini-batches dataset and 15,649 songs for the 8tracks dataset. of 64 playlists, with a learning rate of 0.025, a momentum coefficient of 0.3 and an L2-regularization weight of 0.02. A.3 Playlist-based collaborative filtering For both datasets, the hyperparameters and architecture of the RNN models trained on shuffled and reversed playlists This model computes the similarity of each test playlist to all were re-tuned, but since other configurations did not yield the training playlists in the dataset, that is, 13,744 playlists for clearly better results, we decided to use the same settings for the AotM-2011 dataset and 61,416 playlists for the 8tracks consistency. dataset. We also experimented using 100, 500 and 1000 train- ing playlists but did not achieve better results. B Additional experiments A.4 Recurrent neural networks We conduct, for completeness, a related set of experiments We experiment with different loss functions, namely cate- consisting in reversing the song order instead of random- gorical cross-entropy, Bayesian pairwise ranking (BPR) [26] izing it. A similar experiment was proposed by Chen et and TOP-1 [12]. The RNN is optimized using AdaGrad [9] al. [6] to investigate the importance of the “directional- with momentum and L2-regularization. We also experiment ity” in next-song recommendations. Chen et al. found only with dropout [30] in the recurrent layer, but none of the small performance differences evaluating the Latent Markov Fig. 8 Reversed song order experiments. Distribution of ranks attained which a prediction is made. The y-axis indicates the attained ranks, by the actual next songs in the test playlists (closer to 1 is better) for the and its scale relates to the number of songs in each dataset. The box- AotM-2011 and the 8tracks datasets. The panels include the predictions plots report the distribution of attained ranks. Outliers are indicated with of the RNN on the original playlists and on the different reversed song small horizontal marks. The number of next-song predictions made at order experiments. The x-axis indicates the position in the playlist for every position is annotated in the boxplots 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 113 Embedding model on original and reversed playlists. We 15. Jannach D, Lerche L, Kamehkhosh I (2015) Beyond “hitting the hits”: generating coherent music playlist continuations with the replicate the different settings from our previous experi- right tracks. In: Proceedings of RecSys, pp 187–194 ments: we first train the RNN model on original playlists 16. Jansson A, Raffel C, Weyde T (2015) This is my jam data dump. and evaluate it on reversed playlists. Then, we train the In: Proceedings of ISMIR RNN model on reversed playlists and evaluate it on origi- 17. Kamehkhosh I, Jannach D, Bonnin G (2018) How automated rec- ommendations affect the playlist creation behavior of users. In: nal playlists. Finally, we train and evaluate the RNN model Joint proceedings of IUI workshops, Tokyo, Japan on reversed playlists. Figure 8 reports the rank distributions 18. Knees P, Pohle T, Schedl M, Widmer G (2006) Combining audio- under each reversed song order experiment. As expected, the based similarity with web-based data to accelerate automatic music results are comparable to those reported in Fig. 6. That is, the playlist generation. In: Proceedings of international workshop on multimedia IR, pp 147–154 distribution of ranks is comparable for all the reversed song 19. Lee JH, Bare B, Meek G (2011) How similar is too similar? order experiments. Exploring users’ perceptions of similarity in playlist evaluation. In: Proceedings of ISMIR, pp 109–114 20. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint References arXiv:1506.00019 21. Logan B (2002) Content-based playlist generation: exploratory 1. Aizenberg N, Koren Y, Somekh O (2012) Build your own music experiments. In: Proceedings of ISMIR recommender by modeling internet radio streams. In: Proceedings 22. McFee B, Lanckriet GR (2011) The natural language of playlists. of WWW, pp 1–10 In: Proceedings of ISMIR, pp 537–542 2. Bergstra J, Bengio Y (2012) Random search for hyper-parameter 23. McFee B, Lanckriet GR (2012) Hypergraph models of playlist optimization. J Mach Learn Res 13(1):281–305 dialects. In: Proceedings of ISMIR, pp 343–348 3. Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P (2011) The 24. Platt JC, Burges CJ, Swenson S, Weare C, Zheng A (2002) Learn- million song dataset. In: Proceedings of ISMIR, pp 591–596 ing a Gaussian process prior for automatically generating music 4. Bonnin G, Jannach D (2014) Automated generation of music playlists. In: Proceedings of NIPS, pp 1425–1432 playlists: survey and experiments. ACM Comput Surv 47(2):1–35 25. Pohle T, Pampalk E, Widmer G (2005) Generating similarity-based 5. Celma O (2010) Music recommendation and discovery. Springer, playlists using traveling salesman algorithms. In: Proceedings of Berlin DAFx, pp 220–225 6. Chen S, Moore JL, Turnbull D, Joachims T (2012) Playlist pre- 26. Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2009) diction via metric embedding. In: Proceedings of SIGKDD, pp BPR: Bayesian personalized ranking from implicit feedback. In: 714–722 Proceedings of UAI, pp 452–461 7. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares 27. Rendle S, Freudenthaler C, Schmidt-Thieme L (2010) Factorizing F, Schwenk H, Bengio Y (2014) Learning phrase representations personalized Markov chains for next-basket recommendation. In: using RNN encoder–decoder for statistical machine translation. Proceedings of WWW, pp 811–820 arXiv preprint arXiv:1406.1078 28. Ricci F, Rokach L, Shapira B (2015) Recommender systems hand- 8. Cunningham SJ, Bainbridge D, Falconer A (2006) “More of an art book, 2nd edn. Springer, New York than a science”: supporting the creation of playlists and mixes. In: 29. Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based col- Proceedings of ISMIR laborative filtering recommendation algorithms. In: Proceedings of 9. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods WWW, pp 285–295 for online learning and stochastic optimization. J Mach Learn Res 30. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov 12:2121–2159 R (2014) Dropout: a simple way to prevent neural networks from 10. Flexer A, Schnitzer D, Gasser M, Widmer G (2008) Playlist gen- overfitting. J Mach Learn Res 15(1):1929–1958 eration using start and end songs. In: Proceedings of ISMIR, pp 31. Tan YK, Xu X, Liu Y (2016) Improved recurrent neural net- 173–178 works for session-based recommendations. In: Proceedings of 11. Hariri N, Mobasher B, Burke R (2012) Context-aware music DLRS@RecSys, pp 17–22 recommendation based on latent topic sequential patterns. In: Pro- 32. Vall A, Schedl M, Widmer G, Quadrana M, Cremonesi P (2017) ceedings of RecSys, pp 131–138 The importance of song context in music playlists. In: RecSys 2017 12. Hidasi B, Karatzoglou A, Baltrunas L, Tikk D (2016) Session- poster proceedings, Como, Italy based recommendations with recurrent neural networks. In: Pro- 33. Vall A, Quadrana M, Schedl M, Widmer G (2018) The importance ceedings of ICLR of song context and song order in automated music playlist gener- 13. Hidasi B, Quadrana M, Karatzoglou A, Tikk D (2016) Parallel ation. In: Proceedings of ICMPC-ESCOM, Graz, Austria recurrent neural network architectures for feature-rich session- based recommendations. In: Proceedings of RecSys, pp 241–248 14. Jannach D, Ludewig M (2017) When recurrent neural networks Publisher’s Note Springer Nature remains neutral with regard to juris- meet the neighborhood for session-based recommendation. In: Pro- dictional claims in published maps and institutional affiliations. ceedings of RecSys, pp 306–310 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Multimedia Information Retrieval Springer Journals

Order, context and popularity bias in next-song recommendations

Loading next page...
 
/lp/springer-journals/order-context-and-popularity-bias-in-next-song-recommendations-2SX5qty01B
Publisher
Springer Journals
Copyright
Copyright © 2019 by The Author(s)
Subject
Computer Science; Multimedia Information Systems; Information Storage and Retrieval; Information Systems Applications (incl.Internet); Data Mining and Knowledge Discovery; Image Processing and Computer Vision; Database Management
ISSN
2192-6611
eISSN
2192-662X
DOI
10.1007/s13735-019-00169-8
Publisher site
See Article on Publisher Site

Abstract

The availability of increasingly larger multimedia collections has fostered extensive research in recommender systems. Instead of capturing general user preferences, the task of next-item recommendation focuses on revealing specific session preferences encoded in the most recent user interactions. This study focuses on the music domain, particularly on the task of music playlist continuation, a paradigmatic case of next-item recommendation. While the accuracy achieved in next-song recommendations is important, in this work we shift our focus toward a deeper understanding of fundamental playlist characteristics, namely the song order, the song context and the song popularity, and their relation to the recommendation of playlist continuations. We also propose an approach to assess the quality of the recommendations that mitigates known problems of off-line experiments for music recommender systems. Our results indicate that knowing a longer song context has a positive impact on next-song recommendations. We find that the long-tailed nature of the playlist datasets makes simple and highly expressive playlist models appear to perform comparably, but further analysis reveals the advantage of using highly expressive models. Finally, our experiments suggest that the song order is not crucial to accurately predict next-song recommendations. Keywords Music recommender systems · Music playlist continuation · Sequential recommendation · Collaborative filtering · Recurrent neural networks 1 Introduction According to interviews with practitioners and postings to a dedicated playlist-sharing website, Cunningham et al. [8] Automated music playlist continuation is a specific task in identified the choice of songs and the song order as impor- music recommender systems where the user sequentially tant aspects of the playlist curation process. As we review in receives song recommendations, producing a listening expe- Sect. 2, some approaches to automated music playlist con- rience similar to traditional radio broadcasting. Sequential tinuation take into account the current and previous songs recommendation scenarios are in fact very natural in the in the playlist and the order of the songs in the playlist to music domain. This is possibly explained by the short time recommend the next song. However, to the best of our knowl- required to listen to a song, which results in listening sessions edge, previous works do not explicitly analyze the impact typically including not one, but several songs. of exploiting this information for next-song recommenda- tions. We refer to the current and previous songs in a playlist Andreu Vall as the “song context” available to the recommender system andreu.vall@jku.at when it predicts the next song. This terminology is borrowed Massimo Quadrana from language models and should not be confused with the mquadrana@pandora.com incorporation of user’s contextual information into the rec- Markus Schedl ommender system. markus.schedl@jku.at In this work, we compare four well established and widely Gerhard Widmer used playlist models: a popularity-based model, a song-based gerhard.widmer@jku.at collaborative filtering (CF) model, a playlist-based CF model and a model based on recurrent neural networks (RNNs). Institute of Computational Perception, Johannes Kepler University Linz, Linz, Austria These playlist models are of increasing complexity and, by design, are able to exploit the song context and the song Austrian Research Institute for Artificial Intelligence, Vienna, Austria order to different extents. By analyzing and comparing their performance on different playlist continuation off-line exper- Pandora Media Inc., Oakland, CA, USA 123 102 International Journal of Multimedia Information Retrieval (2019) 8:101–113 iments, we derive insights regarding the impact that the song The Latent Markov Embedding introduced by Chen et context, the song order and the bias toward popular music al. [6] models playlists as Markov chains. It projects songs have on next-song recommendations. For the evaluation of into a Euclidean space such that the distance between two the off-line experiments, we propose to use metrics derived projected songs represents their transition probability. The from complete recommendation lists, instead of from the importance of the direction of song transitions is evalu- top K positions of recommendation lists. This provides a ated by testing a model on actual playlists and on playlists more complete view on the performance of the playlist mod- with reversed transitions, yielding comparable performance els. in both cases. McFee and Lanckriet [23] also treat playlists The remainder of this paper is organized as follows. Sec- as Markov chains, modeled as random walks on song hyper- tion 2 reviews the related work on automated music playlist graphs, where the edges are derived from multimodal song continuation. Section 3 introduces the guidelines for the features, and the weights are learned from hand-curated off-line experiments conducted throughout this work. We music playlists. The importance of modeling song transi- describe the recommendation task that the playlist models tions is assessed by learning the hypergraph weights again must fulfill and define the metrics employed to assess their but treating the playlists as a collection of song single- performance on the task. Section 4 describes the four playlist tons. When song transitions are ignored, the performance models considered. Section 5 presents the datasets of hand- degrades. These works examine the importance of account- curated music playlists on which we conduct the off-line ing for song transitions and their order, but the Markovian experiments. Section 6 elaborates on the results of the off-line assumption implies that only adjacent songs are considered. experiments and is divided into three parts, which discuss the Hariri et al. [11] represent songs by latent topics extracted impact of the song context, the popularity bias and the song from song-level social tags. Sequential pattern mining is per- order on next-song recommendations, respectively. Conclu- formed at the topic level, so that given seed songs, a next topic sions are drawn in Sect. 7. can be predicted. Re-ranking the results of a CF model with the predicted latent topics is found to outperform the plain CF model. This approach considers the ordering but only at the topic level, which is more abstract than the song level. 2 Related work Hidasietal. [12] propose for e-commerce and video streaming an approach to sequential recommendation based A well-researched approach to automated music playlist on the combination of RNNs with ranking-aware loss func- tions. This approach has gained attention and has been further continuation relies on the song content. Pairwise song simi- larities are computed on the basis of features extracted from improved and extended [13,31]. Jannach and Ludewig [14] the audio signal (possibly enriched with social tags and meta- have applied it to the task of automated music playlist data) and used to enforce content-wise smooth transitions continuation in a study that compares the performance of [10,18,21,22,25]. Recommendations based on content sim- RNN models and session-based nearest-neighbors mod- ilarity are expected to yield coherent playlists. However, els for sequential recommendation. Among other analyses, pure content-based recommendations cannot capture com- Jannach and Ludewig question whether the computational plex relations and, in fact, it does not hold in general that the complexity of RNN models is justified. Recommendation songs in a playlist should all sound similar [19]. models based on RNNs consider the full item context in sequences and are also aware of their order. Playlist continuation has also been regarded as a form of collaborative filtering (CF), making the analogy that playlists For a comprehensive survey on automated music playlist continuation, we point the interested reader to Bonnin and are equivalent to user listening histories on the basis of which songs should be recommended. Playlist-based nearest- Jannach [4] and Ricci et al. [28, chap.13]. neighbors CF models and factorization-based CF models We conducted preliminary studies preceding this work exploit the full song context when making next-song rec- analyzing the importance of the song order and the song ommendations [1,4,11]. Song-based nearest-neighbors CF context in next-song recommendations [32,33]. This paper models [29] are not common in the playlist continuation lit- further extends these works by incorporating a detailed erature. However, Hidasi et al. [12] show in the domains discussion of the proposed evaluation methodology, an addi- of e-commerce and video streaming that an item-based CF tional playlist model (namely a playlist-based CF model), an model that predicts the next item on the basis of only the cur- analysis of the impact of the popularity bias of music col- lections in next-song recommendations and more conclusive rent item can effectively deal with short histories. In general, CF models disregard the song order, but it is worth noting experiments to determine the importance of the song order. We also provide the full configurations and training details that the model presented by Aizenberg et al. [1] accounts for neighboring songs, and the model introduced by Rendle et for the playlist models. al. [27] for on-line shopping is aware of sequential behavior. 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 103 3 Evaluation methodology in the playlist and the position in the playlist for which the prediction is made. We propose an evaluation methodology based on the ability We index the ordered list of next-song candidates from 1 of playlist models at retrieving withheld playlist continua- (most likely) until N (least likely), where N is the number of tions. We conduct the next-item recommendation experiment unique songs in the dataset. A good playlist model is expected proposed by Hidasi et al. [12] and then propose approaches to rank the actual next song in top positions (rank values to interpreting the obtained results and to comparing the per- close to 1). On the other hand, a poor model would rank the formance of different playlist models. actual next song on bottom positions (large rank values). A We are aware that off-line evaluation approaches are random model would, on average, rank the actual next song approximations of the actual recommendation task, and they on positions around N /2. may not be able to fully estimate the final user satisfaction. However, the aim of this work is to understand the importance 3.2 Assessing the quality of the recommendations of the song context, the song order and the bias toward pop- ular music on next-song recommendations. In this sense, the Previous research in automated music playlist continuation proposed off-line evaluation methodology serves this pur- has summarized the distribution of attained ranks using met- pose well, because it allows the systematic comparison of rics derived from the top K positions of the ordered lists different playlist models under controlled conditions. of next-song candidates. For example, the recall at K (also named “hit rate” at K ) is defined as the proportion of times that the actual next songs in the test playlists attain a rank 3.1 Next-song recommendation experiment lower than K [4,11,14,15]. The rationale behind fixing the length K , typically to a small value, is that, in practice, A collection of music playlists is split into training and test only the top K results are of interest to the end user of the playlists. A trained playlist model is evaluated by repeating recommender system. We claim that this approach has two the following procedure over all the test playlists, which, for important limitations: (1) values of K that are reasonable clarity, we describe alongside the example depicted in Fig. 1. for on-line systems (with actual end users) are not necessar- We consider a test playlist (e.g., p = (s , s , s )). In the first ily reasonable for off-line evaluation (without end users); (2) 3 5 2 step, we show the model the first song in the playlist (s ). arbitrarily fixing a value of K provides partial and potentially The model ranks all the songs in the dataset according to misleading information. their likelihood to be the second song in the playlist. We We first discuss the first limitation. A playlist may be keep track of the rank attained by the actual second song extended by a number of potentially relevant songs, but off- in the playlist (s attains rank 3). We also keep track of the line experiments only accept the exact match to the actual fact that this is a prediction for a song in second position. In next song. The rank attained by the actual next song can the second step, we show the model the first and the second be overly pessimistic, because the ordered list of next-song actual songs in the playlist (s , s ). The model ranks all the candidates can actually contain relevant results in better posi- 3 5 songs in the dataset according to their likelihood to be the tions [22,24]. Therefore, the results of off-line evaluation third song in the playlist. We keep track of the rank attained approaches need to be understood as approximations of the by the actual third song in the playlist (s attains rank 1), etc. expected performance of the playlist models. They cannot In this way, we progress until the end of the playlist, always be interpreted in absolute terms but as a means to compare keeping track of the rank attained by the actual next song the relative performance of different models. In particular, values of K meaningful for on-line systems should not be literally transferred to off-line experiments. We now address the second limitation. Even though the playlist models and the datasets will be presented in Sects. 4 and 5, respectively, we advance here some results for the sake of illustration. Figure 2 shows the recall curves of sev- eral playlist models for values of K ranging from 1 to the maximum number of song candidates. If we chose to focus on a fix value of K , this would correspond to only observing a one-point cut of these recall curves, which is very partial Fig. 1 Illustration of the evaluation methodology. The playlist model information. Furthermore, as we have discussed, choosing a is evaluated on the test playlist p = (s , s , s ). It progresses through 3 5 2 specific value of K can become arbitrary in off-line exper- p and ranks all the songs in the dataset according to their likelihood to iments, where the user feedback is missing. Finally, the be the next song. The actual second song, s , attains rank 3. The actual third song, s , attains rank 1 2 information provided by a fix value of K is potentially mis- 123 104 International Journal of Multimedia Information Retrieval (2019) 8:101–113 (a) (b) Fig. 2 Recall curves for values of K ranging from 1 to the maximum top 10 positions are detailed in the boxes, where dots are superimposed number of songs in each playlist dataset. The circles indicate the length only to remind of the discrete nature of the displayed values (the lines K where each playlist model achieves a recall at K of 50% and corre- just connect the different recall values) spond to the median rank achieved by each model. The results on the leading because the recall curves of different playlist models Table 1 Summary of the playlist models cross each other at different values of K . That is, the best Playlist model Context length Order awareness performing playlist model would depend on the chosen value Popularity 0 ✗ of K . Song-CF 1 ✗ For these reasons, we propose to assess playlist models by Playlist-CF n ✗ examining their whole lists of ordered next-song candidates, RNN n ✓ as opposed to focusing only on an arbitrary number of top K positions in the lists. This provides a more complete view The context length is the number of songs considered by the model to of the performance of the playlist models. Even though the predict the next song (n means all the songs shown to the model). Order awareness indicates if the model regards the order of songs in playlists complete recall curves displayed in Fig. 2 are informative, we instead propose to directly compare the whole distribution of ranks attained by each playlist model. We report the distri- where P is the set of training playlists, P (s) is the subset of tr tr bution of attained ranks by means of boxplots that represent training playlists that contain the song s, and |·| denotes the the minimum, first quartile, median, third quartile, and max- number of playlists in each set. Given a test playlist, the next- imum rank values (see, e.g., Fig. 3). Alternatively, we report song candidates are ranked by their popularity, disregarding only the median rank value if this facilitates the interpretation the previous songs and their order. Despite its simplicity, the of the results (Fig. 5). popularity-based model is a competitive playlist model [4,6]. 4.2 Song-based collaborative filtering (“Song-CF”) 4 Playlist models This is a CF model based on song-to-song similarities. A song s is represented by a binary vector p that indicates the We describe the four playlist models considered in our exper- s training playlists to which it belongs. The similarity of a pair iments. By design, the models are of increasing complexity and are able to exploit the song context and the song order to of songs s, t is computed as the cosine between p and p , s t i.e., different extents (Table 1). Hyperparameter tuning, if neces- sary, is performed on validation playlists withheld from the p · p s t training playlists. sim(s, t ) = cos(p , p ) = . s t p p s t Two songs are similar if they co-occur in training playlists, 4.1 Song popularity (“Popularity”) regardless of the positions they occupy in the playlists. We follow Hidasi et al. [12] and implement the song-based CF This is a unigram model that computes the popularity of a song s according to its relative frequency in the training model such that next-song candidates are ranked according to their similarity only to the current song in the playlist, ignor- playlists, i.e., ing previous songs. This approach is relatively simple, but Hidasi et al. show its competitive performance for sequential |P (s)| tr pop(s) = , (1) recommendation on short sessions. |P | tr 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 105 4.3 Playlist-based collaborative filtering 5 Datasets (“Playlist-CF”) We evaluate the four playlist models on two datasets of hand- This is a CF model based on playlist-to-playlist similarities. curated music playlists derived from the on-line playlist- 2 3 A playlist p is represented by a binary vector s indicating the sharing platforms “Art of the Mix” and “8tracks.” Both songs that it includes. The similarity of a pair of playlists p, q platforms allow music aficionados to publish their playlists is computed as the cosine between s and s , i.e., on-line. Moreover, the Art of the Mix platform hosted forums p q and blogs for discussion about playlist curation, as well as social functionalities such as favoriting, or providing direct s · s p q sim(p, q) = cos(s , s ) = . (2) p q feedback to a user. The 8tracks platform also provides social s s p q functionalities, such as following users, liking, or comment- ing on specific playlists. Previous works in the automated The score assigned to a song s as a candidate to extend a test music playlist continuation literature have chosen to work playlist p is computed as with collections derived from the Art of the Mix and the 8tracks databases because of their presumably careful cura- tion process [4,11,15,22,23]. As an illustration of the users’ score(s, p) = sim(p, q), (3) engagement, we refer the interested reader to the study pre- q∈P (s) tr sented by Cunningham et al. [8], that analyzes posts to the Art of the Mix forums requesting advice on, for example, the where P (s) is the subset of training playlists that contain the tr choice of songs, or song ordering rules. song s. This model considers a song to be a suitable continu- The “AotM-2011” dataset [23] is a publicly available ation for playlist p if it has occurred in training playlists that playlist collection derived from the Art of the Mix database. are similar to p. The similarity of a playlist pair (Eq. 2) and Each playlist is represented by song titles and artist names, the score assigned to a candidate song to extend a playlist linked to the corresponding identifiers of the Million Song (Eq. 3) depend on the full playlist p, i.e., on the full song Dataset (MSD) [3], where available. The “8tracks” dataset context, but they disregard the song order. is a private playlists collection derived from 8tracks. Each Playlist-based CF has proven to be a competitive playlist playlist is represented by song titles and artist names. Since model [4,11,14,15]. It usually has an additional parameter we find multiple spellings for the same song–artist pairs, we defining the number of most similar training playlists on use fuzzy string matching to resolve the song titles and artist which Eq. 3 is calculated. We use all the training playlists names against the MSD, adapting the code released by Jans- because we find that this yields best performance in our son et al. [16] for a very similar task. experiments (Appendix A.3). We use the MSD as a common name space to correctly identify song–artist pairs. In both datasets, the songs that could not be resolved against the MSD are discarded, with 4.4 Recurrent neural networks (“RNN”) one of two possible approaches. The first approach consists in simply removing the non-matched songs. The original Recurrent neural networks are a class of neural network mod- playlists are preserved but with skips within them, which we els particularly suited to learn from sequential data. They ignore. The second approach consists in breaking up the orig- have a hidden state that accounts for the input at each time inal playlists into segments of consecutive matched songs, step while recurrently incorporating information from previ- yielding shorter playlists without skips. We show results ous hidden states. We point the interested reader to Lipton et obtained on playlists derived from the first approach, but al. [20] for a review of RNN models. experiments on playlists derived from the second approach We adopt the approach and implementation proposed yielded equivalent conclusions. by Hidasi et al. [12], where an RNN model with one layer We keep only the playlists with at least 3 unique artists and of gated recurrent units (GRU) [7] is combined with a loss with a maximum of 2 songs per artist. This is to discard artist- function designed to optimize the ranking of next-item rec- or album-themed playlists, which may correspond to book- ommendations. The model hyperparameters and architecture are detailed in Appendix A.4. Given a test playlist, the RNN model considers the full http://www.artofthemix.org song context and the song order and outputs a vector of song 3 https://8tracks.com scores used to rank the next-song candidates. 4 Publishing playlists and interacting with individual users are still active services on the Art of the Mix, but the forums and blogs seem to be discontinued. 1 5 https://github.com/hidasib/GRU4Rec https://labrosa.ee.columbia.edu/millionsong 123 106 International Journal of Multimedia Information Retrieval (2019) 8:101–113 Table 2 Descriptive statistics of the filtered AotM-2011 and 8tracks predicts the next song on the basis of the current song but playlist datasets. We report the distribution of playlist lengths, number disregards the previous ones, i.e., it has a context of 1 song. of artists per playlist and song frequency in the datasets (i.e., the number Playlist-CF and RNN predict the next song on the basis of of playlists in which each song occurs) the full playlist, i.e., they have full song context. Dataset Statistic min 1q med 3q max Figure 3 reports the rank distribution of each playlist model. They are split by the position in the playlist for which AotM-2011 Playlist length 5 6 7 8 34 the next-song prediction is made. We consider only predic- Artists per playlist 3 5 7 8 34 tions up to position 8, which represent roughly the 90% of Song frequency 1 8 12 20 249 all the next-song predictions made in the AotM-2011 and 8tracks Playlist length 5 5 6 7 46 the 8tracks datasets. From position 9 onward, the number Artists per playlist 3 5 6 7 41 of predictions quickly decreases and the results become less Song frequency 1 9 15 30 2320 reliable. The results in Fig. 3 show that Popularity and Song-CF do not systematically improve their predictions as they progress marking favorite artists, or saving full albums as playlists. through the playlists. This is the expected result because Pop- While these are also valid criteria, we prefer to exclude them ularity has no context, and Song-CF has a constant context in this work. We also keep only the playlists with at least 5 of 1 song. Their rank distributions remain overall stable with songs to ensure a minimum playlist length. Songs occurring fluctuations easily explained by the fact that at each posi- in less than 10 playlists are removed to ensure that the models tion the models deal with different songs. On the other hand, have sufficient observations for each song. Playlist-CF and RNN are aware of the full song context. The We randomly assign 80% of the playlists to training and results in Fig. 3 show that the performance of Playlist-CF the remaining 20% to test. As in any recommendation task clearly improves as it progresses through the playlists, and blind to item content, the songs that occur only in test playlists the performance of RNN improves slightly but steadily. This need to be removed because they cannot be modeled at indicates that Playlist-CF and RNN benefit from increasingly training time. This affects the final playlist length and song longer song contexts. frequency of the playlist datasets. In terms of absolute model performance, Song-CF is the The filtered AotM-2011 dataset has 17,178 playlists with least competitive model, slightly better but not clearly dif- 7032 unique songs by 2208 artists. The filtered 8tracks ferent than the random reference. Popularity and RNN show dataset has 76,759 playlists with 15,649 unique songs by the most competitive overall performances. Playlist-CF has 4290 artists. Table 2 reports the distribution of playlist difficulties when the song context is short, but it consistently lengths, unique artists per playlist and song frequency in the improves as it gains more context, until it eventually outper- datasets. forms Popularity. Summary of main observations: 6 Results – Playlist-CF and RNN, aware of the full song context, improve their performance as the song context grows. We assess the ability of the four considered playlist mod- – Despite its simplicity, Popularity compares to RNN and, els, Popularity, Song-CF, Playlist-CF and RNN, to recover except for long contexts, outperforms Playlist-CF. withheld playlist continuations as described in Sect. 3.By – Song-CF exhibits a poor performance. comparing the performance of the different playlist models on the same experiment, or the performance of the same model on different experiments, we reason about the impor- 6.2 Popularity bias tance of considering the song context and the song order for next-song recommendations. Furthermore, we study the The previous results pose an apparent contradiction: Popu- impact of the song popularity on the performance of the larity, unaware of the song context, performs comparably to different models. As a reference, all the results include the RNN and, overall, slightly better than Playlist-CF, both aware performance of a dummy model that ranks next-song candi- of the full song context. Is it then important or not to exploit dates at random (we call this model “Random”). the song context? Furthermore, as discussed by Jannach and The position in the playlist for which the next-song prediction is made 6.1 Song context must not be confused with the song context length of the playlist model. For example, making a next-song prediction for a song in position 5, the Recall that Popularity predicts the next song disregarding the playlist-based CF model has a context of 4 songs, while the song-based current and previous songs, i.e., it has no context. Song-CF CF still has a context of 1 song (Table 1). 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 107 2345678 2 345678 2345678 2 345678 2345678 2345678 2345678 2345678 2345678 2345678 position position (b) 8tracks dataset (a) AotM-2011 dataset Fig. 3 Song context experiments. Distribution of ranks attained by the relates to the number of songs in each dataset. The boxplots report the actual next songs in the test playlists (closer to 1 is better) for the AotM- distribution of attained ranks. Outliers are indicated with small horizon- 2011 and the 8tracks datasets. Each panel corresponds to a playlist tal marks. The number of next-song predictions made at every position model. The x-axis indicates the position in the playlist for which a pre- is annotated in the boxplots diction is made. The y-axis indicates the attained ranks, and its scale Ludewig [14], do marginal performance gains of RNN over reasonably well on the most popular songs, and it shows a Playlist-CF and Popularity justify its higher computational quick improvement as the song context grows. Its perfor- complexity? mance on long-tail songs is poorer, but it shows a slight To shed light on these two questions we deem it impor- improvement as it gains song context, until given a context tant to analyze the possible impact of the playlist datasets of at least 5 songs, it outperforms Popularity. The good per- being biased toward popular songs, which is in fact a bias formance of Playlist-CF on popular songs is not surprising ubiquitous in the music consumption domain [5]. Within our because the scoring Eq. (3) favors songs occurring in many study, we identify the popularity of a song with its frequency training playlists. However, the rather poor performance on in each of the datasets, that is, with the number of playlists long-tail songs is less expected, especially if we remember in which it occurs in each dataset. Table 2 and Fig. 4 show that our implementation of Playlist-CF considers all the train- the song frequency distribution of the AotM-2011 and the ing playlists as neighbors (Sect. 4.3), which should help to 8tracks datasets. The AotM-2011 and the 8tracks datasets counteract the large amount of non-popular songs in the present a clear popularity bias, with a vast majority of songs playlist datasets. Song-CF also performs better on popu- occurring in few playlists and a few songs occurring in many lar songs than on long-tail songs, especially in the 8tracks playlists. dataset, where the popularity bias is stronger (Table 2,Fig. 4). We consider again in Fig. 5 the performance of the four RNN is competitive, and most importantly, in contrast to the playlist models, but this time we distinguish whether the other playlist models, its performance is largely unaffected actual next songs in the test playlists were popular or not. by the popularity of the actual next songs in the test playlists. We precisely define the popularity of a song as its relative Focusing on the performance of the playlist models on frequency in the training playlists, as given by Eq. (1). The all next-song predictions (left panels in Fig. 5), Popularity left panels report the median rank obtained when all the next- seems comparable to the more sophisticated RNN. Given song predictions are considered. The central panels report the enough song context, Playlist-CF also seems to compete median rank obtained when the actual next songs belong to with RNN. However, as we have just seen, the overall strong the 10% most popular songs in the datasets. The right panels performance of Popularity and Playlist-CF is the result of report the median rank obtained when the actual next songs aggregating the accurate predictions made for a few popular belong to the 90% least popular songs in the datasets (which songs (central panels in Fig. 5) with the rather poor predic- we refer to as the “long tail”). In this particular case we report tions made for a vast majority of non-popular songs (right only the median rank instead of the whole rank distribution to panels in Fig. 5). On the contrary, the performance of RNN obtain a more compact figure that facilitates the comparison is not affected by the song popularity. This observation must of the playlist models across the different song-popularity be taken into consideration to judge whether the higher com- levels. putational complexity of the RNN model is justified, also The results in Fig. 5 show that Popularity performs out- considering the particular use case and target users of each standingly well on the most popular songs, but it makes poor recommender system. For example, the robustness of RNN to predictions for songs in the long tail. This is the natural con- the popularity bias would be crucial to assist users interested sequence of its very design (Sect. 4.1). Playlist-CF performs in discovering long-tail music. rank rank 108 International Journal of Multimedia Information Retrieval (2019) 8:101–113 (b) (a) Fig. 4 Unique songs in the AotM-2011 and the 8tracks datasets, sorted parentheses). Furthermore, examples of frequent and infrequent songs by frequency, i.e., by the number of playlists in which they occur. The in the datasets are provided, with their absolute frequency annotated in colored dots correspond to songs located at specific percentile positions, parentheses with their absolute and percentile frequencies annotated (the latter in (a) (b) Fig. 5 Popularity bias experiments. Median rank attained by the actual The x-axis indicates the position in the playlist for which a prediction is next songs in the test playlists (closer to 1 is better) for the AotM-2011 made. The y-axis indicates the attained ranks, and its scale relates to the and the 8tracks datasets. Left: all songs are considered. Center: only number of songs in each dataset. The number of next-song predictions the 10% most popular songs in the dataset are considered. Right: only made at every position is annotated the 90% least popular (long-tail) songs in the dataset are considered. Summary of main observations: robustness to dealing with infrequent music. We now inves- tigate the importance of considering the song order by – RNN exhibits a competitive performance which is not comparing the performance of RNN when it deals with orig- affected by the popularity of the actual next songs. inal playlists, and when it deals with playlists where the – Popularity, Song-CF and Playlist-CF exhibit a consider- song order has been manipulated. The rationale behind this able performance gap depending on the popularity of the experiment is the following: if the playlist datasets exhibit a actual next songs. consistent song order that RNN exploits to predict next-song – Despite its overall poor performance on non-popular recommendations, then we should observe a performance songs, Playlist-CF can exploit the song context to even- degradation when the song order is deliberately broken. tually outperform Popularity. We devise three song order manipulation experiments. Firstly, we evaluate RNN on shuffled test playlists. This can be regarded as a weak check, because RNN could still have 6.3 Song order learned patterns based on the song order at training time. Secondly, we train another instance of the RNN model but RNN is the most complex of the four playlist models con- using shuffled training playlists. We name it “RNN .” We re- sidered, and it is the only one aware of the song order. sh tune its hyperparameters to make sure that the performance Furthermore, we have shown its good performance and 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 109 7032 15649 2345678 2345678 2345678 2 345678 2345678 2 345678 2345678 2345678 position position (b) (a) 8tracks dataset AotM-2011 dataset Fig. 6 Song order experiments. Distribution of ranks attained by the y-axis indicates the attained ranks, and its scale relates to the number actual next songs in the test playlists (closer to 1 is better) for the AotM- of songs in each dataset. The boxplots report the distribution of attained 2011 and the 8tracks datasets. The panels report the results of RNN ranks. Outliers are indicated with small horizontal marks. The num- and RNN evaluated on original and shuffled test playlists. The x-axis ber of next-song predictions made at every position is annotated in the sh indicates the position in the playlist for which a prediction is made. The boxplots is not compromised as a consequence of modifying the train- is actually able to capture order information. In any case, ing playlists, but eventually we keep the same configuration we already presume that such a definite song order behavior because others do not yield consistent improvements. Then, will not occur in real situations. We hypothesize with two we evaluate RNN on original test playlists. This is a strong possible sources of variation that may better respond to how sh check, because we now make sure that RNN cannot exploit natural playlists are organized: firstly, instead of a universal sh the song order at training time. For completeness, we also song order, there may exist several song orders correspond- evaluate RNN on shuffled test playlists. ing to, for example, different underlying music taste profiles; sh Figure 6 reports the rank distribution for each song order secondly, one or several orders may exist, but they may be randomization experiment. As a reference, we also include followed in a non-strict manner. We create three additional the performance of RNN evaluated on original test playlists. synthetic datasets according to these variations. We create The rank distributions are split by the position in the playlist a playlist dataset where the song order within playlists is for which the next-song prediction is made. As before, we strictly ruled by one of five arbitrary but fixed song orders, consider only predictions up to position 8, which represent with the same number of playlists following each of the five roughly the 90% of all the next-song predictions made in the orders. We refer to this dataset as “Five orders.” We further AotM-2011 and the 8tracks datasets (Sect. 6.1). Surprisingly, create noisy versions of “One order” and “Five orders” such the rank distributions are comparable across all song order that the song order within the playlists is followed but in randomization experiments, regardless of whether the song a non-strict manner. To achieve this, we copy the original order is original, broken at test time, broken at training time, datasets but replace a randomly chosen 30% of the songs of or broken both at training and at test time. This result provides each playlist by unordered, randomly sampled songs from an indication that the song order may not be an essential outside the playlist. We name the resulting datasets “One feature for next-song recommendations, and it would agree order—30% noise” and “Five orders—30% noise.” with similar findings derived from user experiments [17]. Each of the synthetic datasets has 15,000 playlists with Alternatively, even though RNN models are the state of the 7000 unique songs, and each playlist has a length of exactly art in many sequential tasks, this result could be explained 10 songs. These specific values are chosen so that the by the incapability of this specific RNN model to properly synthetic datasets have similar characteristics to a natural col- exploit the song order. lection like AotM-2011. The concept of artist does not exist, To further investigate this question, we create synthetic and the song orders are defined arbitrarily. Using more than playlist datasets where the song order is controlled. We five song orders, or a noise factor higher than 30%, yielded start creating a playlist dataset where the song order within very challenging datasets that were not as illustrative as the playlists is strictly ruled by an arbitrary but fixed universal created ones. The synthetic datasets are split into training song order. This dataset, which we name “One order,” will let and test playlists as described in Sect. 5. We then use them to us determine whether the considered RNN model (Sect. 4.4) conduct the song order randomization experiments described rank rank 110 International Journal of Multimedia Information Retrieval (2019) 8:101–113 Fig. 7 Randomized song order experiments with synthetic datasets. tion in the playlist for which a prediction is made. The y-axis indicates Distribution of ranks attained by the actual next songs in the synthetic the attained ranks, and its scale relates to the number of songs in each test playlists (closer to 1 is better) for the AotM-2011 and the 8tracks dataset. The boxplots report the distribution of attained ranks. Outliers datasets. The panels report the results of RNN and RNN evaluated are indicated with small horizontal marks. The number of next-song sh on original and shuffled test playlists. The x-axis indicates the posi- predictions made at every position is always exactly 3000 at the beginning of this section. Figure 7 reports the rank dis- dict songs in the proximity of the current song, meaning songs tribution for each song order randomization experiment, for that are few positions before or after the current song within each synthetic playlist dataset. the universal song order. In other words, training the RNN We first analyze the results obtained on “One order” model on shuffled playlists works as a regularization that (Fig. 7a), which should let us determine whether the consid- favors learning song proximity rather than strict song order. ered RNN model is actually able to capture order information. We have found evidence that the playlists in the AotM- The performance of RNN on original test playlists is perfect, 2011 and the 8tracks datasets are not ruled by a strict, with all ranks equal to 1. This shows that the considered universal song order. In fact, since we train dedicated RNN model would be able to capture a universal song order instances of the RNN model for each dataset, we know that if there were one. Consequently, we can conclude that the the playlists are also not ruled by a strict, dataset-specific playlists in the AotM-2011 and the 8tracks datasets are not song order. However, this does not imply that the playlists strictly ordered; otherwise, RNN would have been able to are, on the other end, completely unordered. There may perfectly extend them. Again in Fig. 7a, precisely because exist intermediate song order rules, which in real situations RNN learned the song order strictly, its performance on shuf- may further be affected by different sources of uncertainty. fled test playlists is comparatively very poor. We now move We examine the results obtained on the remaining synthetic on to RNN . The performance of RNN is not perfect but datasets (Fig. 7b–d) as an approximation to the more com- sh sh very good on both original and shuffled test playlists. This plex song order patterns that may underlie the playlists of the suggests that RNN follows a different strategy than RNN. AotM-2011 and the 8tracks datasets. sh Since RNN is trained on shuffled playlists, the strict song Figure 7b, c shows an overall, noticeable performance sh order is not anymore enforced. Instead, RNN learns to pre- degradation compared to Fig. 7a. Still, RNN evaluated on sh 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 111 original test playlists performs almost perfectly for “Five 7 Conclusion orders” and competitively for “One order—30% noise.” The performance degrades strongly when RNN is evaluated on We have explicitly investigated the impact of the song order, shuffled test playlists, which suggests that it had indeed cap- the song context and the popularity bias in music playlists for tured song order patterns. RNN again shows that training the task of predicting next-song recommendations. We have sh on shuffled playlists provides a regularization effect, because conducted dedicated off-line experiments on two datasets its performance on original and shuffled test playlists is of hand-curated music playlists comparing the following comparable. Figure 7d reports the results on the most chal- playlist models: a popularity-based model, a song-based CF lenging of the synthetic datasets and exhibits a generalized, model, a playlist-based CF model, and an RNN model. These clear performance degradation. While we cannot derive fur- models are well established and widely used and exploit the ther conclusions regarding the nature of the playlists in the song context and the song order to different extents. Our AotM-2011 and the 8tracks datasets, the comparison between results indicate that the playlist-based CF model and the RNN Figs. 6 and 7 suggests that real playlists may indeed be sub- model, which can consider the full song context, do benefit ject to complex, noisy song order rules. from increasingly longer song contexts. However, we observe We know that the playlists in the AotM-2011 and the that a longer song context does not necessarily translate into 8tracks datasets do not follow simple song order patterns, outperforming the simpler popularity-based model, which is and thus they could be either completely unordered or ruled unaware of the song context. This is explained by the popular- by complex, noisy song order rules. In both cases, the song ity bias in the datasets, i.e., the coexistence of few, popular order experiments on the natural datasets show that RNN songs with many, non-popular songs. Failing to take into performs equivalently to RNN (Fig. 6), which could be the account the popularity bias masks important performance sh result of RNN adopting the same strategy of RNN , that is, differences: the popularity-based model, the song-based CF sh relying on song proximity patterns rather than on complex or model and the playlist-based CF model exhibit considerable inexistent song order patterns. Since the AotM-2011 and the differences in performance depending on the popularity of 8tracks datasets do not have a strict song order, the concept of the actual next songs in the test playlists. On the contrary, the proximity could be understood as fitness, meaning that RNN more complex RNN model has a stable performance regard- and RNN could be predicting next songs that fit well the less of the song popularity. This effect must be taken into sh playlist being extended. This could also explain why RNN account in the design of playlist models for specific use cases and RNN improve their performance as the song context and target users. The RNN model is the only of the consid- sh grows, namely because a longer song context better specifies ered playlist models aware of the song order. We have found the playlist under consideration. A longer song context would that its performance on original and shuffled playlists is com- not necessarily translate into performance improvements if parable, suggesting either that the song order is not crucial RNN could rely on clear song order patterns, in which case for next-song recommendations, or that the RNN model is knowing a single song would suffice to accurately predict the unable to fully exploit it. We have further investigated this next one (see the performance of RNN evaluated on original question by evaluating the RNN model on synthetic datasets playlists in Fig. 7a, c). with controlled song orders. We have found that the RNN model is able to capture a universal song order if there is one. This implies that the natural playlists datasets considered do not follow a strict song order, although they might be ruled by complex, noisy song order rules. Finally, and regarding Summary of main observations: the evaluation methodology, we have proposed an approach – RNN achieves comparable performance on original and to assess the quality of the recommendations that observes on shuffled test playlists. It also compares to RNN , sh the complete recommendation lists instead of focusing on the which is completely unaware of the song order. top K recommendations. Doing so provides a more complete – Experiments on strictly ordered synthetic datasets show view on the performance of the playlist models. that RNN can learn song order patterns. Acknowledgements We thank Bruce Ferwerda, Rainer Kelz, Rocío – From the previous, we conclude that the AoM-2011 and del Río Lorenzo, and David Sears for their valuable feedback. This the 8tracks datasets are not strictly ordered. research has received funding from the European Research Council – Further experiments on synthetic datasets suggest that (ERC) under the European Union’s Horizon 2020 research and innova- the AotM-2011 and the 8tracks datasets might be ruled tion programme under grant agreement No 670035 (Con Espressione). Open access funding provided by Johannes Kepler University Linz. by complex, noisy song order rules. – When RNN relies on song fitness patterns rather than on Open Access This article is distributed under the terms of the Creative song order patterns, it benefits from longer song contexts, Commons Attribution 4.0 International License (http://creativecomm which better identify the playlist being extended. ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, 123 112 International Journal of Multimedia Information Retrieval (2019) 8:101–113 and reproduction in any medium, provided you give appropriate credit final configurations use it. We tune the number of units, the to the original author(s) and the source, provide a link to the Creative learning rate, the batch size, the amount of momentum, the Commons license, and indicate if changes were made. L2-regularization weight and the dropout probability on a withheld validation set, by running 100 random search exper- iments [2] for each of the loss functions mentioned above. A Model configurations The final model configuration is chosen according to the val- idation recall at 100 (i.e., the proportion of times that the A.1 Song popularity actual next song is included within the top 100 ranked can- didates), which we consider a proxy of the model’s ability to This model computes the popularity of all the unique songs rank the actual next songs in top positions. The number of in the dataset, that is, 7032 songs for the AotM-2011 dataset training epochs is chosen on the basis of the validation loss. and 15,649 songs for the 8tracks dataset. For the AotM-2011 dataset, the final model uses the TOP-1 loss and it has 200 hidden units. It is trained on mini-batches A.2 Song-based collaborative filtering of 16 playlists, with a learning rate of 0.01, a momentum coefficient of 0.5 and an L2-regularization weight of 0.1. This model computes pairwise similarities for all the unique For the 8tracks dataset, the final model uses the TOP-1 loss songs in the dataset, that is, 7032 songs for the AotM-2011 and it has 200 hidden units. It is trained on mini-batches dataset and 15,649 songs for the 8tracks dataset. of 64 playlists, with a learning rate of 0.025, a momentum coefficient of 0.3 and an L2-regularization weight of 0.02. A.3 Playlist-based collaborative filtering For both datasets, the hyperparameters and architecture of the RNN models trained on shuffled and reversed playlists This model computes the similarity of each test playlist to all were re-tuned, but since other configurations did not yield the training playlists in the dataset, that is, 13,744 playlists for clearly better results, we decided to use the same settings for the AotM-2011 dataset and 61,416 playlists for the 8tracks consistency. dataset. We also experimented using 100, 500 and 1000 train- ing playlists but did not achieve better results. B Additional experiments A.4 Recurrent neural networks We conduct, for completeness, a related set of experiments We experiment with different loss functions, namely cate- consisting in reversing the song order instead of random- gorical cross-entropy, Bayesian pairwise ranking (BPR) [26] izing it. A similar experiment was proposed by Chen et and TOP-1 [12]. The RNN is optimized using AdaGrad [9] al. [6] to investigate the importance of the “directional- with momentum and L2-regularization. We also experiment ity” in next-song recommendations. Chen et al. found only with dropout [30] in the recurrent layer, but none of the small performance differences evaluating the Latent Markov Fig. 8 Reversed song order experiments. Distribution of ranks attained which a prediction is made. The y-axis indicates the attained ranks, by the actual next songs in the test playlists (closer to 1 is better) for the and its scale relates to the number of songs in each dataset. The box- AotM-2011 and the 8tracks datasets. The panels include the predictions plots report the distribution of attained ranks. Outliers are indicated with of the RNN on the original playlists and on the different reversed song small horizontal marks. The number of next-song predictions made at order experiments. The x-axis indicates the position in the playlist for every position is annotated in the boxplots 123 International Journal of Multimedia Information Retrieval (2019) 8:101–113 113 Embedding model on original and reversed playlists. We 15. Jannach D, Lerche L, Kamehkhosh I (2015) Beyond “hitting the hits”: generating coherent music playlist continuations with the replicate the different settings from our previous experi- right tracks. In: Proceedings of RecSys, pp 187–194 ments: we first train the RNN model on original playlists 16. Jansson A, Raffel C, Weyde T (2015) This is my jam data dump. and evaluate it on reversed playlists. Then, we train the In: Proceedings of ISMIR RNN model on reversed playlists and evaluate it on origi- 17. Kamehkhosh I, Jannach D, Bonnin G (2018) How automated rec- ommendations affect the playlist creation behavior of users. In: nal playlists. Finally, we train and evaluate the RNN model Joint proceedings of IUI workshops, Tokyo, Japan on reversed playlists. Figure 8 reports the rank distributions 18. Knees P, Pohle T, Schedl M, Widmer G (2006) Combining audio- under each reversed song order experiment. As expected, the based similarity with web-based data to accelerate automatic music results are comparable to those reported in Fig. 6. That is, the playlist generation. In: Proceedings of international workshop on multimedia IR, pp 147–154 distribution of ranks is comparable for all the reversed song 19. Lee JH, Bare B, Meek G (2011) How similar is too similar? order experiments. Exploring users’ perceptions of similarity in playlist evaluation. In: Proceedings of ISMIR, pp 109–114 20. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint References arXiv:1506.00019 21. Logan B (2002) Content-based playlist generation: exploratory 1. Aizenberg N, Koren Y, Somekh O (2012) Build your own music experiments. In: Proceedings of ISMIR recommender by modeling internet radio streams. In: Proceedings 22. McFee B, Lanckriet GR (2011) The natural language of playlists. of WWW, pp 1–10 In: Proceedings of ISMIR, pp 537–542 2. Bergstra J, Bengio Y (2012) Random search for hyper-parameter 23. McFee B, Lanckriet GR (2012) Hypergraph models of playlist optimization. J Mach Learn Res 13(1):281–305 dialects. In: Proceedings of ISMIR, pp 343–348 3. Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P (2011) The 24. Platt JC, Burges CJ, Swenson S, Weare C, Zheng A (2002) Learn- million song dataset. In: Proceedings of ISMIR, pp 591–596 ing a Gaussian process prior for automatically generating music 4. Bonnin G, Jannach D (2014) Automated generation of music playlists. In: Proceedings of NIPS, pp 1425–1432 playlists: survey and experiments. ACM Comput Surv 47(2):1–35 25. Pohle T, Pampalk E, Widmer G (2005) Generating similarity-based 5. Celma O (2010) Music recommendation and discovery. Springer, playlists using traveling salesman algorithms. In: Proceedings of Berlin DAFx, pp 220–225 6. Chen S, Moore JL, Turnbull D, Joachims T (2012) Playlist pre- 26. Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2009) diction via metric embedding. In: Proceedings of SIGKDD, pp BPR: Bayesian personalized ranking from implicit feedback. In: 714–722 Proceedings of UAI, pp 452–461 7. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares 27. Rendle S, Freudenthaler C, Schmidt-Thieme L (2010) Factorizing F, Schwenk H, Bengio Y (2014) Learning phrase representations personalized Markov chains for next-basket recommendation. In: using RNN encoder–decoder for statistical machine translation. Proceedings of WWW, pp 811–820 arXiv preprint arXiv:1406.1078 28. Ricci F, Rokach L, Shapira B (2015) Recommender systems hand- 8. Cunningham SJ, Bainbridge D, Falconer A (2006) “More of an art book, 2nd edn. Springer, New York than a science”: supporting the creation of playlists and mixes. In: 29. Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based col- Proceedings of ISMIR laborative filtering recommendation algorithms. In: Proceedings of 9. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods WWW, pp 285–295 for online learning and stochastic optimization. J Mach Learn Res 30. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov 12:2121–2159 R (2014) Dropout: a simple way to prevent neural networks from 10. Flexer A, Schnitzer D, Gasser M, Widmer G (2008) Playlist gen- overfitting. J Mach Learn Res 15(1):1929–1958 eration using start and end songs. In: Proceedings of ISMIR, pp 31. Tan YK, Xu X, Liu Y (2016) Improved recurrent neural net- 173–178 works for session-based recommendations. In: Proceedings of 11. Hariri N, Mobasher B, Burke R (2012) Context-aware music DLRS@RecSys, pp 17–22 recommendation based on latent topic sequential patterns. In: Pro- 32. Vall A, Schedl M, Widmer G, Quadrana M, Cremonesi P (2017) ceedings of RecSys, pp 131–138 The importance of song context in music playlists. In: RecSys 2017 12. Hidasi B, Karatzoglou A, Baltrunas L, Tikk D (2016) Session- poster proceedings, Como, Italy based recommendations with recurrent neural networks. In: Pro- 33. Vall A, Quadrana M, Schedl M, Widmer G (2018) The importance ceedings of ICLR of song context and song order in automated music playlist gener- 13. Hidasi B, Quadrana M, Karatzoglou A, Tikk D (2016) Parallel ation. In: Proceedings of ICMPC-ESCOM, Graz, Austria recurrent neural network architectures for feature-rich session- based recommendations. In: Proceedings of RecSys, pp 241–248 14. Jannach D, Ludewig M (2017) When recurrent neural networks Publisher’s Note Springer Nature remains neutral with regard to juris- meet the neighborhood for session-based recommendation. In: Pro- dictional claims in published maps and institutional affiliations. ceedings of RecSys, pp 306–310

Journal

International Journal of Multimedia Information RetrievalSpringer Journals

Published: Apr 3, 2019

References