manga

Go to the manga explore page to try out the recommendations. Note that this page may take some time to load.

See the manga embedding plots page for a visualization of the recommendations through the embedding space.

See the manga feed page for a feed of manga from MangaDex that have been re-ranked.

notes

overview
embedding overview
- distribution of tags
latent semantic indexing (LSI)
- document-term matrix and tf-idf
- indexing and querying with SVD
word2vec embeddings
network embeddings
approximate nearest neighbors
- nn-descent
- locality-sensitive hashing
evaluation
- jaccard index
- null model
personalization
changelog

overview

We’ve reached the point where we can finally build a reasonable model for making content recommendations. As noted in the groups notes, there are two main recommendation models. The first is collaborative filtering models that rely on implicit user interactions (e.g., favoriting a series). The second class of models is content-based models that rely on properties of the data (e.g., tags, descriptions, etc.). User interaction data is not available through the public API. Instead, we focus on content-based models with the potential for personalization in the future.

We can construct three interesting models with tag data alone. We treat these as baselines for more complex models. These all account for the distributional semantics of tags, but in slightly different ways. The first model is a latent semantic indexing (LSI) model, which uses the singular values of a scaled document-term matrix to project the manga into a low-rank space. We recommend the most similar manga based on the cosine similarity of the left singular vector found by the singular value decomposition (SVD). The second builds on the word2vec model, a neural network that learns to represent words in a vector space. We represent each manga by the average of its tag vectors. Then we use the cosine similarity between manga vectors to make recommendations. The final model makes recommendations based on neighbors in the manga-tag bipartite network. We count all the possible paths between two manga through shared tags to create a manga-manga network. We then recommend the most similar manga based on the number of shared tags. To ensure consistency among models, we embed the network onto a manifold using UMAP with justification on both the adjacency and laplacian matrices.

One immediate extension we can make to these models is to use frequent itemsets as words instead of just tags. In other words, we consider all bi-grams, tri-grams, etc. of tags as words. Another extension is to incorporate the description of each manga into the vector-space models using transformer models. One such method is to take the BERT or GPT-Neo embeddings of the description and append them to the tag embeddings. Finally, we can consider adding in popularity or rating statistics to the models. These statistics would likely act as a bias term to re-rank recommendations, and would likely require hyper-parameter tuning to be effective.

embedding overview

We will talk about embeddings throughout the rest of this page. An embedding is a low-dimensional geometric representation of a high-dimensional space which preserves the relative distances between points. These embeddings are metric spaces defined by their embedding e.g. the Euclidean distance or cosine distance. The Euclidean distance is the most common metric, and is what you would use to measure the distance between two points in a 2D or 3D space. It is also known as the $L^2$ norm, and is defined as:

d(p, q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2} = \|p - q\|

We might instead be interested in the angle between two points. This is common with very sparse vectors, where each column of the vector might represent a (normalized) count feature. The cosine distance would then reflect the number of common features between two vectors as an angle. The cosine distance is defined as the dot product between two points divided by the product of their norms:

d(p, q) = 1 - \frac{p \cdot q}{\|p\| \|q\|} = \frac{\sum{p_i q_i}}{\sqrt{\sum{p_i^2}} \sqrt{\sum{q_i^2}}}

We build embeddings because the low-dimensional vector space is much easier to work with and visualize. Our recommendation models can simply compare the distance between two points in the embedding space to determine how similar they are. We make use of efficient algorithms to find the nearest neighbors to a given point, and can even use the embedding to visualize the data. One example that we go into depth later is embedding a network (or graph) into a smaller vector. There are 64k manga in the dataset; when represented as an adjacency matrix of (64464, 64464) elements of 32-bit floats it takes up 15.5 GiB of memory. We can preserve distances by embedding this into a vector of (64464, 256) elements, which takes up 62 MiB of memory. This is several orders of magnitude smaller, while preserving nearest neighbor relationships from the original space.

distribution of tags

Our embeddings will focus on tag data. These are available for all manga, and is the primary way that users might search for manga. All of the embeddings exploit distributional semantics of tags i.e. the meaning of a tag is determined by the tags that appear near it, and the meaning of a manga is determined by the tags that appear on it.

Here we show a histogram often a given tag appears in the dataset.

The histogram represents the distribution of tag counts. We note that a few tags appear very frequently, while most tags appear very infrequently. If we take the log of the tag counts, we can see that the distribution looks more like a normal distribution. The raw aggregated data can be found in the following table:

genre

name	count
Romance	25827
Comedy	22401
Drama	19622
Slice of Life	13781
Fantasy	12470
Boys’ Love	11896
Action	11559
Adventure	7232
Girls’ Love	4989
Psychological	4780
Tragedy	4103
Mystery	4103
Historical	3744
Sci-Fi	3053
Horror	2802
Isekai	2112
Thriller	1892
Sports	1600
Crime	1156
Philosophical	841
Mecha	810
Wuxia	365
Superhero	359
Medical	342
Magical Girls	308

theme

name	count
School Life	13235
Supernatural	9179
Magic	2860
Martial Arts	2453
Harem	2389
Monsters	2262
Demons	1789
Crossdressing	1762
Survival	1730
Reincarnation	1708
Loli	1670
Office Workers	1459
Animals	1444
Military	1361
Incest	1238
Shota	1203
Video Games	1138
Delinquents	1081
Monster Girls	965
Genderswap	933
Time Travel	839
Ghosts	755
Cooking	755
Police	712
Vampires	622
Music	608
Mafia	586
Post-Apocalyptic	546
Aliens	538
Reverse Harem	459
Villainess	415
Gyaru	362
Samurai	343
Zombies	303
Virtual Reality	226
Ninja	215
Traditional Games	133

format

name	count
Oneshot	14874
Doujinshi	11985
Web Comic	10328
Full Color	9369
Long Strip	9084
Adaptation	4225
Anthology	1664
4-Koma	1277
Official Colored	1211
Award Winning	692
User Created	531
Fan Colored	92

content

name	count
Sexual Violence	2103
Gore	1783

We also take a look at how many tags each manga has.

Like before, we note that the distribution is heavy-tailed, and looks more normal when we take the log of the counts.

latent semantic indexing (LSI)

Latent semantic indexing (LSI) a traditional information retrieval technique used to find similar documents based on the frequency of words in the documents. It indexes into a latent (low-rank, probabilistic) space formed by the singular value decomposition (SVD) of a scaled document-term matrix. The LSI recommendation model is theoretically sound, and makes a good baseline for comparison.

document-term matrix and tf-idf

We generate a term-document matrix $C$ where each row is a manga, and each column is a tag count. We normalize the tag counts by term frequency-inverse document frequency (tf-idf) to account for the fact that some tags appear more frequently than others. This helps the relevance of tags that occur rarely, but form a coherent cluster in the data. The term frequency is the relative frequency of a term $t$ in a document $d$ :

tf(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}

The inverse document frequency is a measure of how much information a term provides. If a term appears in more documents, then it is less informative and is therefore down-weighted.

idf(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}|}

Together, this gives us the tf-idf score:

tfidf(t, d, D) = tf(t, d) \cdot idf(t, D)

This is a common weighting scheme that is used in information retrieval systems like Lucene and Elasticsearch.

indexing and querying with SVD

We could use the tf-idf vector representation of terms to find similar manga. However, this representation doesn’t take into account the relationships between terms. We can use the SVD to find a lower-dimensional representation (latent space) of the data that preserves the relationships between terms. This is a type of soft-clustering, where each manga is now represented as a weighted sum of membership in various tag clusters.

SVD decomposes a matrix $M$ into a left singular vector, a diagonal matrix of singular values, and a right singular vector.

M = U \Sigma V^T

This is a linear transformation that finds a projection that preserves the largest variance in the data. To make a query into the latent space, we first transform a set of tags into a tf-idf query vector $q$ . We then map this query vector into the latent space by multiplying it by the left singular vector $U$ .

\vec{q_k} = \Sigma_k^{-1} U_k^T \vec{q}

Then, we just find the nearest neighbors by finding the closest points by cosine distance.

word2vec embeddings

The word2vec model is a neural network that learns to represent words in a vector space. We use the word2vec model to represent tags in a vector space. We represent each manga as an average of its tag vectors, and use the cosine similarity between manga vectors to make recommendations.

There are two methods for training word2vec models: continuous bag of words (CBOW) and skip-gram. The skip-gram model is order-sensitive, and learns the probability of a word given its context. This is not ideal for our use case, because tags are not ordered. The PySpark implementation of word2vec relies on the skip-gram model, so we use gensim to train a CBOW word2vec model on the tag data. The set of tags on a manga is treated as a sentence in the model. We set the embedding size to a vector of size 16, which is justified by the square root of the number of tags rounded to the nearest power of 2. We visualize the tag embeddings using UMAP.

visualization method: tsne pca umap

The dimensions are meaningless, but the relative distances between tags are meaningful. We see meaningful clusters of tags, such as “isekai”, “villianess”, “reincarnation”, and “fantasy”.

network embeddings

We can also look at recommendations through a network representation.

construction

We represent our network as matrices. We generate a bipartite adjacency matrix $B$ between manga and tags, where each row is a manga, and each column is a tag. We project this matrix into a single mode, to get a manga-manga network where two manga are connected by the dot product of their tags.

A = B B^T

There are two representations that interest us: the adjacency matrix and the Laplacian matrix. The adjacency matrix has a row and column for each manga, where two manga are connected by the number of shared tags. The Laplacian matrix is the degree matrix minus the adjacency matrix.

L = D - A

The laplacian matrix is related to connectivity of the graph, and is often used for clustering.

removing indirect effects

We explore two techniques to remove indirect effects from the network, as described by this nature article (scihub link). These techniques were designed for a gene regulatory network inference task, but can be applied to our network. Our network construction is purely indirect, since it relies on paths through tags to connect manga. As we introduce derivative tags representing pairs and triplets of tags, we introduce even more indirect effects. Therefore, it might be interesting to determine the effects of removing indirect effects from the network.

Neither of these methods are tractable on the full 64k manga dataset, but it’d be interesting to try this on a smaller subset of the data in the future.

network deconvolution

Network deconvolution is a general method for removing indirect effects from a network. We use the following closed form equation to represent the transitive closure over the network.

G_{\text{obs}} = G_{\text{dir}} (I - G\_{\text{dir}})^{-1}

The network deconvolution is represented as follows:

G_{\text{dir}} = G_{\text{obs}} (I + G\_{\text{obs}})^{-1}

Concretely, we perform the following operation, which rescales the eigenvalues found through the eigen decomposition of the adjacency matrix.

def network_deconvolution(G):
    """Deconvolve a network matrix using the eigen decomposition.
    https://www.nature.com/articles/nbt.2635
    """
    # eigen decomposition
    lam, v = scipy.linalg.eigh(G)
    # rescale the eigenvalues
    # also add small value to avoid division by zero
    lam_dir = lam/(1+lam)
    # reconstruct the deconvolved matrix
    G_dir = v @ np.diag(lam_dir) @ v.T

    # remove diagonal, threshold positive values, and rescale in-place
    np.fill_diagonal(G_dir, 0)
    np.maximum(G_dir, 0, out=G_dir)
    G_dir /= np.max(G_dir)

    # NOTE: check reconstruction
    # G_dir = v @ np.diag(lam) @ v.T
    return G_dir

We find that finding the naive implementation using scipy’s eigen decomposition of the adjacency matrix is very slow. We left the deconvolution running for 8 hours and aborted the program due to the intractability. We note that there is an alternative reparameterization that uses gradient descent, but this is out of scale for this experiment.

We can also SVD of the adjacency matrix to find the deconvolution. Instead of rescaling the eigenvectors of the eigen-decomposition, we just find the inverse of the matrix through SVD by inverting the $\Sigma$ matrix. Both methods have the same big-O complexity of $O(n^3)$ .

Truncating the results of the eigen decomposition or SVD leads to unstable results on small toy examples; the effects of approximating the matrix are uncertain in this case.

global silencing of indirect correlations

Global silencing of indirect correlations is another method for removing indirect effects from a network. The closed form of the silenced network where indirect effects are removed is given by:

S = (G - I + \mathcal{D}((G-I)G))G^{-1}

where $\mathcal{D}$ is an operator that sets everything but the diagonal to zero.

This is a closed form solution that can be computed relatively quickly given the inverse of the adjacency matrix. We can use the psuedo-inverse of the adjacency matrix by computing the truncated SVD of the adjacency matrix.

def network_silencing(G):
    """Remove indirect effects using a silencing method.
    https://www.nature.com/articles/nbt.2601
    """
    # svd of G
    u, s, vh = scipy.linalg.svd(G)

    peturb = np.diag((G - np.eye(G.shape[0])) @ G)
    G_dir = (
        (G - np.eye(G.shape[0]) + np.diag(peturb))
        # multiply with pseudo inverse
        @ (u @ np.linalg.pinv(np.diag(s)) @ vh)
    )

    # remove diagonal, threshold positive values, and rescale in-place
    np.fill_diagonal(G_dir, 0)
    np.maximum(G_dir, 0, out=G_dir)
    G_dir /= np.max(G_dir)

    # NOTE: check reconstruction
    # G_dir = u @ np.diag(s) @ vh
    return G_dir

comparisons on toy networks

First we take a look at the network on ring, where weights between successive nodes are 1, and weights between non-successive nodes are inversely weighted by the distance between the nodes (plus some random noise to break symmetry).

clique original

clique deconvolved

clique silenced

We see that network deconvolution will keep edges with nodes that it shares neighbors with. The network silencing instead removes weak edges in the original network.

The semantics of which method is better depends on the construction of the network. We take a look at another network, which is a bipartite network that we project onto one mode. We treat the two subsets of nodes and manga and tags, where manga are labeled 0-4 and tags labeled 5-7. There are node_id % len(tags) edges from each manga to each tag, which are weighted by the originating id of the node. When we project this, we get a network where each manga is connected to another manga through the number of tags that it shares. For example, nodes 2 and 4 share two neighbors, so they are connected by an edge that is the dot product of their edge weights.

bipartite original

bipartite projected

bipartite deconvolved

bipartite silenced

We then run network deconvolution and silencing on the projected network. We note note that the deconvolution method removes edges going through the most common tags, although the behavior is difficult to intuit. The silencing method retains the same structure as the projected network where nodes are connected by edges representing the number of paths between them through tags.

We also note in the context of a fully connected network with equal weights (a clique), the deconvolution method removes all edges while the silencing method retains all edges.

embedding

The spectral embedding model is a method for dimensionality reduction based on network connectivity. We use this to reduce the dimensionality of the network, and to ensure that modeling backends are represented using the same API.

approximate nearest neighbors

We make recommendations by finding the nearest neighbors of each manga in vector space.

nn-descent

NN-Descent is a fast approximate nearest neighbor algorithm that uses a graph to find the nearest neighbors of a point. pynndescent is an implementation of the algorithm that is suitable for our purposes. It supports a variety of distance metrics including cosine distances.

locality-sensitive hashing

Locality-sensitive hashing (LSH) is a fast approximate nearest neighbor algorithm that uses a hash function to find the nearest neighbors of a point. LSH is a potential alternative to nn-descent that’s implemented directly in PySpark. However, we are limited to MinHash and Random Projection methods, which only work for Jaccard and Euclidean distances.

evaluation

jaccard index

The Jaccard index is a measure of similarity between two sets. It is defined as follows:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

null model

The null model is a baseline for evaluating the performance of a recommendation model. We create a null model by selecting random recommendations for each manga. The null model takes in no information about the data. We then compare the performance through a statistical significance test. We consider using the jaccard R package to perform the tests.

personalization

Personalization is one future direction we could take in developing a recommendation model. We might achieve this by incorporating a weighted average of a query vector with a user’s personalization vector. For example, a user is known to like manga X, Y, and Z. We can represent this preference using the weighted average of tag vectors to form a personalization vector. Then when we query the database for similar manga to a query manga, we also average in user preferences.

changelog

2022-12-20: Initial version of the manga recommendations page.
2022-12-27: Added reference to network analysis, section on embedding plots
2022-12-29: Filling in more of the details for embeddings and some models.
2022-12-30: network cleanup work, final embedding plots