Pythonでレコメンドシステムを作る(コンテンツベースフィルタリング)

今回はコンテンツベースフィルタリングで、ジョークをお薦めするシステムを作ります。

コンテンツベースフィルタリングとは

コンテンツベースフィルタリングは、アイテムそのものの特徴を利用して、推薦したいユーザがこれまで高評価したアイテムと類似するアイテムを推薦する方法です。
例えば、過去に「ハリー・ポッターと賢者の石」と「指輪物語 1」を高評価しているユーザは、おそらく「小説」「ファンタジー」を嗜好しているため、「ナルニア国物語 1」などを薦める、といった感じです。

ユーザの評価値のみを使って推薦する協調フィルタリングと比較すると、以下の違いがあります。

アイテムのコンテンツを入力する処理が必要
アイテムのコンテンツから特徴ベクトルを抽出する処理が必要
多様性・新規性に欠ける(過去に評価したアイテムと類似するため)
システム全体のユーザ数・評価数が少ない状況でも、予測精度は下がらない
まだどのユーザからも評価されていないアイテムの評価値も予測できる

コンテンツベースフィルタリングを実装する

今回はJester Datasetを使ってユーザにジョークをお薦めするシステムを実装してみます。

コンテンツベースフィルタリングでは、大きく3ステップで推薦します。

アイテムの特徴ベクトルを抽出する
アイテム間の特徴ベクトルの類似度を算出する
過去に高く評価したアイテムと類似度が高いアイテムが上位になるようにランキング化する

Jester Dataset

今回は、英語のジョーク集の評価を集めたJester Collaborative Filtering Datasetを使います。

これは150種類のジョークに対して50692人が-10〜+10で評価しています。

Jester Datasets

本家サイトのは評価値がxlsファイルだったり、ジョーク本文をhtmlからパースする必要があったりするため、非常に扱いにくいものになっています。
そこで、csv/tsvファイルに加工されてkaggleで提供されているデータセットがありますので、そこから jesterfinal151cols.csv (評価値) と jester_items.tsv (ジョーク本文)をダウンロードして、これから作るPythonファイルと同じディレクトリに配置してください。

www.kaggle.com

ただし、そのkaggleのjester_items.tsvは150番目のジョークが抜けているので、本家から150番目のジョークをダウンロードして以下の1行を追加します。

150: In an interview with David Letterman, Carter passed along an anecdote of a translation problem in Japan. Carter was speaking at a business lunch in Tokyo, where he decided to open his speech with a brief joke. He told the joke, then waited for the translator to announce the Japanese version. Even though the story was quite short, Carter was surprised by how quickly the interpreter was able to re-tell it. Even more impressive was the reaction from the crowd. Carter thought the story was cute, but not outright hilarious, yet the crowd broke right up. Carter was very flattered. After the speech, Carter wanted to meet the translator to ask him how he told the joke. Perhaps there is better way to tell the joke? When Carter asked how the joke had been told in Japanese, the translator responded, "I told them, 'President Carter has told a very funny joke. Please laugh now.'"

特徴ベクトルの抽出

まずはjester_items.tsvからジョーク本文をDataFrameに読み込みます。

import pandas as pd
pd.set_option("display.max_colwidth", 1000)


jokes = pd.read_csv('jester_items.tsv', sep='\t', names=['id', 'joke'])
jokes.drop('id', axis=1, inplace=True)
print('Joke {0}: {1}'.format(0, jokes.ix[0, 'joke']))
# Joke 0: A man visits the doctor. The doctor says, "I have bad news for you. You have cancer and Alzheimer's disease". The man replies, "Well, thank God I don't have cancer!"

f:id:ohke:20170927092221p:plain

次にTF-IDFで特徴ベクトルを抽出します。
以前の投稿も参考にしてください。

Porterでステミングすることで、時制や単複の違いを吸収しています
語の出現頻度による足切りは行いません(min_df=1)

scikit-learnでスパムメッセージを分類する(TfidfVectorizer + PorterStemmer + BernoulliNB) - け日記

from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


class PorterTokenizer(object):
    def __init__(self):
        self.porter = PorterStemmer()
        
    def __call__(self, doc):
        return [self.porter.stem(w) for w in word_tokenize(doc)]
    

# Porterでステミングして、TF-IDFで特徴ベクトルを構築する
vectorizer = TfidfVectorizer(norm='l2', min_df=1, tokenizer=PorterTokenizer())
vectorizer.fit(jokes['joke'])

print(vectorizer.transform(jokes.ix[0]))
#   (0, 1877)  0.136130990259
#   (0, 1809)  0.143046021434
#   (0, 1779)  0.21261619802
#   ...
print(vectorizer.vocabulary_)
# {'a': 98, 'man': 1049, 'visit': 1779, 'the': 1672, 'doctor': 555, '.': 26, ...

類似度の算出

得られた特徴ベクトル同士のコサイン類似度を計算します。こちらも以前の投稿で同様の計算をしていますね。

scikit-learnでスパムメッセージを分類する(TfidfVectorizer + PorterStemmer + BernoulliNB) - け日記

実装上の留意点が2点あります。

類似度行列は対称行列になります
- similarities[0][2]とsimilarities[2][0]はいずれ0番目と2番目のジョークの類似度を表しており、同じ値です
類似度行列の対角成分はnp.nanと置いています
vectorizer.transformで得られる特徴ベクトルはscipy.sparse.csr.csr_matrix(疎行列用のデータ型)となっており扱いにくいので、toarrayでndarrayに変換しています

得られた類似度行列の最大値は0.978で、112番目と132番目のジョークでした。 2つを比較してみると、", walking by,"が入っているかどうかの違いだけで、同じ内容となっていました。

import numpy as np


def get_similarities(words_vector):
    """
    BoWベクトル同士のコサイン類似度行列を計算する
    """
    similarities = np.zeros((words_vector.shape[0], words_vector.shape[0]))
    
    for i1, v1 in enumerate(words_vector):
        for i2, v2 in enumerate(words_vector):
            if i1 == i2:
                similarities[i1][i2] = np.nan
            else:
                similarities[i1][i2] = np.dot(v1, v2.T) / (np.linalg.norm(v1) * np.linalg.norm(v2))

    return similarities


similarities = get_similarities(vectorizer.transform(jokes['joke']).toarray())

print('Max similarity: {0:.3f}'.format(np.nanmax(similarities), np.nanargmax(similarities)))
# Max similarity: 0.978
print('Joke {0}:\n{1}'.format(np.nanargmax(similarities) // similarities.shape[0], 
                              jokes.ix[np.nanargmax(similarities) // similarities.shape[0], 'joke']))
print('Joke {0}:\n{1}'.format(np.nanargmax(similarities) % similarities.shape[0], 
                              jokes.ix[np.nanargmax(similarities) % similarities.shape[0], 'joke']))
# Max similarity: 0.978
# Joke 112:
# The new employee stood before the paper shredder looking confused. "Need some help?" a secretary asked. "Yes," he replied. "How does this thing work?" "Simple," she said, taking the fat report from his hand and feeding it into the shredder. "Thanks, but where do the copies come out?"
# Joke 132:
# The new employee stood before the paper shredder looking confused. "Need some help?" a secretary, walking by, asked. "Yes," he replied, "how does this thing work?" "Simple," she said, taking the fat report from his hand and feeding it into the shredder. "Thanks, but where do the copies come out?"

評価値の予測

評価値をロードして、任意のユーザとアイテムの評価値を予測します。

まずjesterfinal151cols.csvファイルをndarrayにロードします。

評価値は-10.0〜+10.0の連続値となっており、未評価は99です
csvファイルに値が入っていない列があるため、filling_valuesで未評価(99)で補完しています
- numpy.loadtxtではこうした補完ができずエラーになってしまうので、genfromtxtを使っています

次に、評価値を予測する関数predict_ratingを実装します。
ユーザuのアイテムaに対する評価値を、以下の式で予測しています。 $sim(a,b)$ はアイテムaとbの類似度、Iは評価済みアイテム集合、r_bはアイテムbの評価値です。
他のユーザの評価値を使わずに、対象ユーザの評価値とアイテムの特徴ベクトルの類似度から計算していることに着目してください(協調フィルタリングと異なる点ですね)。

0番目のユーザの0番目のアイテムに対する評価値を予測すると、3.391となりました。

# 評価値のロード
ratings = np.genfromtxt('jesterfinal151cols.csv', delimiter=',', filling_values=(99))
ratings = np.delete(ratings, 0, axis=1)

print('Rating 0:\n{}'.format(ratings[0]))
# Rating 0:
# [  9.90000000e+01   9.90000000e+01   9.90000000e+01   9.90000000e+01
#   2.18750000e-01   9.90000000e+01  -9.28125000e+00  -9.28125000e+00
# ...


def predict_rating(rating, item_index, similarities):
    """
    評価値を予測する
    """
    numerator = 0.0
    enumerator = 0.0
    
    for i, r in enumerate(rating):
        if r != 99:
            numerator += similarities[item_index][i] * r
            enumerator += similarities[item_index][i]
            
    return numerator / enumerator if enumerator != 0 else np.nan


# 評価値の予測
print('Predicted rating: {:.3f}'.format(predict_rating(ratings[0], 0, similarities)))
# Predicted rating: 3.391