Go to the manga explore page to try out the recommendations. Note that this page may take some time to load.

See the manga embedding plots page for a visualization of the recommendations through the embedding space.

See the manga feed page for a feed of manga from MangaDex that have been re-ranked.


table of contents


We’ve reached the point where we can finally build a reasonable model for making content recommendations. As noted in the groups notes, there are two main recommendation models. The first is collaborative filtering models that rely on implicit user interactions (e.g., favoriting a series). The second class of models is content-based models that rely on properties of the data (e.g., tags, descriptions, etc.). User interaction data is not available through the public API. Instead, we focus on content-based models with the potential for personalization in the future.

We can construct three interesting models with tag data alone. We treat these as baselines for more complex models. These all account for the distributional semantics of tags, but in slightly different ways. The first model is a latent semantic indexing (LSI) model, which uses the singular values of a scaled document-term matrix to project the manga into a low-rank space. We recommend the most similar manga based on the cosine similarity of the left singular vector found by the singular value decomposition (SVD). The second builds on the word2vec model, a neural network that learns to represent words in a vector space. We represent each manga by the average of its tag vectors. Then we use the cosine similarity between manga vectors to make recommendations. The final model makes recommendations based on neighbors in the manga-tag bipartite network. We count all the possible paths between two manga through shared tags to create a manga-manga network. We then recommend the most similar manga based on the number of shared tags. To ensure consistency among models, we embed the network onto a manifold using UMAP with justification on both the adjacency and laplacian matrices.

One immediate extension we can make to these models is to use frequent itemsets as words instead of just tags. In other words, we consider all bi-grams, tri-grams, etc. of tags as words. Another extension is to incorporate the description of each manga into the vector-space models using transformer models. One such method is to take the BERT or GPT-Neo embeddings of the description and append them to the tag embeddings. Finally, we can consider adding in popularity or rating statistics to the models. These statistics would likely act as a bias term to re-rank recommendations, and would likely require hyper-parameter tuning to be effective.

embedding overview

We will talk about embeddings throughout the rest of this page. An embedding is a low-dimensional geometric representation of a high-dimensional space which preserves the relative distances between points. These embeddings are metric spaces defined by their embedding e.g. the Euclidean distance or cosine distance. The Euclidean distance is the most common metric, and is what you would use to measure the distance between two points in a 2D or 3D space. It is also known as the L2L^2 norm, and is defined as:

d(p,q)=i=1n(piqi)2=pqd(p, q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2} = \|p - q\|

We might instead be interested in the angle between two points. This is common with very sparse vectors, where each column of the vector might represent a (normalized) count feature. The cosine distance would then reflect the number of common features between two vectors as an angle. The cosine distance is defined as the dot product between two points divided by the product of their norms:

d(p,q)=1pqpq=piqipi2qi2d(p, q) = 1 - \frac{p \cdot q}{\|p\| \|q\|} = \frac{\sum{p_i q_i}}{\sqrt{\sum{p_i^2}} \sqrt{\sum{q_i^2}}}

We build embeddings because the low-dimensional vector space is much easier to work with and visualize. Our recommendation models can simply compare the distance between two points in the embedding space to determine how similar they are. We make use of efficient algorithms to find the nearest neighbors to a given point, and can even use the embedding to visualize the data. One example that we go into depth later is embedding a network (or graph) into a smaller vector. There are 64k manga in the dataset; when represented as an adjacency matrix of (64464, 64464) elements of 32-bit floats it takes up 15.5 GiB of memory. We can preserve distances by embedding this into a vector of (64464, 256) elements, which takes up 62 MiB of memory. This is several orders of magnitude smaller, while preserving nearest neighbor relationships from the original space.

distribution of tags

Our embeddings will focus on tag data. These are available for all manga, and is the primary way that users might search for manga. All of the embeddings exploit distributional semantics of tags i.e. the meaning of a tag is determined by the tags that appear near it, and the meaning of a manga is determined by the tags that appear on it.

Here we show a histogram often a given tag appears in the dataset.

tag count histogram

The histogram represents the distribution of tag counts. We note that a few tags appear very frequently, while most tags appear very infrequently. If we take the log of the tag counts, we can see that the distribution looks more like a normal distribution. The raw aggregated data can be found in the following table:


name count
Romance 25827
Comedy 22401
Drama 19622
Slice of Life 13781
Fantasy 12470
Boys’ Love 11896
Action 11559
Adventure 7232
Girls’ Love 4989
Psychological 4780
Tragedy 4103
Mystery 4103
Historical 3744
Sci-Fi 3053
Horror 2802
Isekai 2112
Thriller 1892
Sports 1600
Crime 1156
Philosophical 841
Mecha 810
Wuxia 365
Superhero 359
Medical 342
Magical Girls 308


name count
School Life 13235
Supernatural 9179
Magic 2860
Martial Arts 2453
Harem 2389
Monsters 2262
Demons 1789
Crossdressing 1762
Survival 1730
Reincarnation 1708
Loli 1670
Office Workers 1459
Animals 1444
Military 1361
Incest 1238
Shota 1203
Video Games 1138
Delinquents 1081
Monster Girls 965
Genderswap 933
Time Travel 839
Ghosts 755
Cooking 755
Police 712
Vampires 622
Music 608
Mafia 586
Post-Apocalyptic 546
Aliens 538
Reverse Harem 459
Villainess 415
Gyaru 362
Samurai 343
Zombies 303
Virtual Reality 226
Ninja 215
Traditional Games 133


name count
Oneshot 14874
Doujinshi 11985
Web Comic 10328
Full Color 9369
Long Strip 9084
Adaptation 4225
Anthology 1664
4-Koma 1277
Official Colored 1211
Award Winning 692
User Created 531
Fan Colored 92


name count
Sexual Violence 2103
Gore 1783

We also take a look at how many tags each manga has.

tag count per manga histogram

Like before, we note that the distribution is heavy-tailed, and looks more normal when we take the log of the counts.

latent semantic indexing (LSI)

Latent semantic indexing (LSI) a traditional information retrieval technique used to find similar documents based on the frequency of words in the documents. It indexes into a latent (low-rank, probabilistic) space formed by the singular value decomposition (SVD) of a scaled document-term matrix. The LSI recommendation model is theoretically sound, and makes a good baseline for comparison.

document-term matrix and tf-idf

We generate a term-document matrix CC where each row is a manga, and each column is a tag count. We normalize the tag counts by term frequency-inverse document frequency (tf-idf) to account for the fact that some tags appear more frequently than others. This helps the relevance of tags that occur rarely, but form a coherent cluster in the data. The term frequency is the relative frequency of a term tt in a document dd:

tf(t,d)=ft,dtdft,dtf(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}

The inverse document frequency is a measure of how much information a term provides. If a term appears in more documents, then it is less informative and is therefore down-weighted.

idf(t,D)=logD{dD:td}idf(t, D) = \log \frac{|D|}{|\{d \in D : t \in d\}|}

Together, this gives us the tf-idf score:

tfidf(t,d,D)=tf(t,d)idf(t,D)tfidf(t, d, D) = tf(t, d) \cdot idf(t, D)

This is a common weighting scheme that is used in information retrieval systems like Lucene and Elasticsearch.

indexing and querying with SVD

We could use the tf-idf vector representation of terms to find similar manga. However, this representation doesn’t take into account the relationships between terms. We can use the SVD to find a lower-dimensional representation (latent space) of the data that preserves the relationships between terms. This is a type of soft-clustering, where each manga is now represented as a weighted sum of membership in various tag clusters.

SVD decomposes a matrix MM into a left singular vector, a diagonal matrix of singular values, and a right singular vector.

M=UΣVTM = U \Sigma V^T

This is a linear transformation that finds a projection that preserves the largest variance in the data. To make a query into the latent space, we first transform a set of tags into a tf-idf query vector qq. We then map this query vector into the latent space by multiplying it by the left singular vector UU.

qk=Σk1UkTq\vec{q_k} = \Sigma_k^{-1} U_k^T \vec{q}

Then, we just find the nearest neighbors by finding the closest points by cosine distance.

word2vec embeddings

The word2vec model is a neural network that learns to represent words in a vector space. We use the word2vec model to represent tags in a vector space. We represent each manga as an average of its tag vectors, and use the cosine similarity between manga vectors to make recommendations.

There are two methods for training word2vec models: continuous bag of words (CBOW) and skip-gram. The skip-gram model is order-sensitive, and learns the probability of a word given its context. This is not ideal for our use case, because tags are not ordered. The PySpark implementation of word2vec relies on the skip-gram model, so we use gensim to train a CBOW word2vec model on the tag data. The set of tags on a manga is treated as a sentence in the model. We set the embedding size to a vector of size 16, which is justified by the square root of the number of tags rounded to the nearest power of 2. We visualize the tag embeddings using UMAP.

tag word2vec umap
visualization method:

The dimensions are meaningless, but the relative distances between tags are meaningful. We see meaningful clusters of tags, such as “isekai”, “villianess”, “reincarnation”, and “fantasy”.

network embeddings

We can also look at recommendations through a network representation.


We represent our network as matrices. We generate a bipartite adjacency matrix BB between manga and tags, where each row is a manga, and each column is a tag. We project this matrix into a single mode, to get a manga-manga network where two manga are connected by the dot product of their tags.


There are two representations that interest us: the adjacency matrix and the Laplacian matrix. The adjacency matrix has a row and column for each manga, where two manga are connected by the number of shared tags. The Laplacian matrix is the degree matrix minus the adjacency matrix.

L=DA L = D - A

The laplacian matrix is related to connectivity of the graph, and is often used for clustering.

removing indirect effects

We explore two techniques to remove indirect effects from the network, as described by this nature article (scihub link). These techniques were designed for a gene regulatory network inference task, but can be applied to our network. Our network construction is purely indirect, since it relies on paths through tags to connect manga. As we introduce derivative tags representing pairs and triplets of tags, we introduce even more indirect effects. Therefore, it might be interesting to determine the effects of removing indirect effects from the network.

Neither of these methods are tractable on the full 64k manga dataset, but it’d be interesting to try this on a smaller subset of the data in the future.

network deconvolution

Network deconvolution is a general method for removing indirect effects from a network. We use the following closed form equation to represent the transitive closure over the network.

Gobs=Gdir(IG_dir)1 G_{\text{obs}} = G_{\text{dir}} (I - G\_{\text{dir}})^{-1}

The network deconvolution is represented as follows:

Gdir=Gobs(I+G_obs)1 G_{\text{dir}} = G_{\text{obs}} (I + G\_{\text{obs}})^{-1}

Concretely, we perform the following operation, which rescales the eigenvalues found through the eigen decomposition of the adjacency matrix.

def network_deconvolution(G):
    """Deconvolve a network matrix using the eigen decomposition.
    # eigen decomposition
    lam, v = scipy.linalg.eigh(G)
    # rescale the eigenvalues
    # also add small value to avoid division by zero
    lam_dir = lam/(1+lam)
    # reconstruct the deconvolved matrix
    G_dir = v @ np.diag(lam_dir) @ v.T

    # remove diagonal, threshold positive values, and rescale in-place
    np.fill_diagonal(G_dir, 0)
    np.maximum(G_dir, 0, out=G_dir)
    G_dir /= np.max(G_dir)

    # NOTE: check reconstruction
    # G_dir = v @ np.diag(lam) @ v.T
    return G_dir

We find that finding the naive implementation using scipy’s eigen decomposition of the adjacency matrix is very slow. We left the deconvolution running for 8 hours and aborted the program due to the intractability. We note that there is an alternative reparameterization that uses gradient descent, but this is out of scale for this experiment.

We can also SVD of the adjacency matrix to find the deconvolution. Instead of rescaling the eigenvectors of the eigen-decomposition, we just find the inverse of the matrix through SVD by inverting the Σ\Sigma matrix. Both methods have the same big-O complexity of O(n3)O(n^3).

Truncating the results of the eigen decomposition or SVD leads to unstable results on small toy examples; the effects of approximating the matrix are uncertain in this case.

global silencing of indirect correlations

Global silencing of indirect correlations is another method for removing indirect effects from a network. The closed form of the silenced network where indirect effects are removed is given by:

S=(GI+D((GI)G))G1 S = (G - I + \mathcal{D}((G-I)G))G^{-1}

where D\mathcal{D} is an operator that sets everything but the diagonal to zero.

This is a closed form solution that can be computed relatively quickly given the inverse of the adjacency matrix. We can use the psuedo-inverse of the adjacency matrix by computing the truncated SVD of the adjacency matrix.

def network_silencing(G):
    """Remove indirect effects using a silencing method.
    # svd of G
    u, s, vh = scipy.linalg.svd(G)

    peturb = np.diag((G - np.eye(G.shape[0])) @ G)
    G_dir = (
        (G - np.eye(G.shape[0]) + np.diag(peturb))
        # multiply with pseudo inverse
        @ (u @ np.linalg.pinv(np.diag(s)) @ vh)

    # remove diagonal, threshold positive values, and rescale in-place
    np.fill_diagonal(G_dir, 0)
    np.maximum(G_dir, 0, out=G_dir)
    G_dir /= np.max(G_dir)

    # NOTE: check reconstruction
    # G_dir = u @ np.diag(s) @ vh
    return G_dir
comparisons on toy networks

First we take a look at the network on ring, where weights between successive nodes are 1, and weights between non-successive nodes are inversely weighted by the distance between the nodes (plus some random noise to break symmetry).

clique original

clique deconvolved

clique silenced

We see that network deconvolution will keep edges with nodes that it shares neighbors with. The network silencing instead removes weak edges in the original network.

The semantics of which method is better depends on the construction of the network. We take a look at another network, which is a bipartite network that we project onto one mode. We treat the two subsets of nodes and manga and tags, where manga are labeled 0-4 and tags labeled 5-7. There are node_id % len(tags) edges from each manga to each tag, which are weighted by the originating id of the node. When we project this, we get a network where each manga is connected to another manga through the number of tags that it shares. For example, nodes 2 and 4 share two neighbors, so they are connected by an edge that is the dot product of their edge weights.

bipartite original

bipartite projected

bipartite deconvolved

bipartite silenced

We then run network deconvolution and silencing on the projected network. We note note that the deconvolution method removes edges going through the most common tags, although the behavior is difficult to intuit. The silencing method retains the same structure as the projected network where nodes are connected by edges representing the number of paths between them through tags.

We also note in the context of a fully connected network with equal weights (a clique), the deconvolution method removes all edges while the silencing method retains all edges.


The spectral embedding model is a method for dimensionality reduction based on network connectivity. We use this to reduce the dimensionality of the network, and to ensure that modeling backends are represented using the same API.

approximate nearest neighbors

We make recommendations by finding the nearest neighbors of each manga in vector space.


NN-Descent is a fast approximate nearest neighbor algorithm that uses a graph to find the nearest neighbors of a point. pynndescent is an implementation of the algorithm that is suitable for our purposes. It supports a variety of distance metrics including cosine distances.

locality-sensitive hashing

Locality-sensitive hashing (LSH) is a fast approximate nearest neighbor algorithm that uses a hash function to find the nearest neighbors of a point. LSH is a potential alternative to nn-descent that’s implemented directly in PySpark. However, we are limited to MinHash and Random Projection methods, which only work for Jaccard and Euclidean distances.


jaccard index

The Jaccard index is a measure of similarity between two sets. It is defined as follows:

J(A,B)=ABAB J(A, B) = \frac{|A \cap B|}{|A \cup B|}

null model

The null model is a baseline for evaluating the performance of a recommendation model. We create a null model by selecting random recommendations for each manga. The null model takes in no information about the data. We then compare the performance through a statistical significance test. We consider using the jaccard R package to perform the tests.


Personalization is one future direction we could take in developing a recommendation model. We might achieve this by incorporating a weighted average of a query vector with a user’s personalization vector. For example, a user is known to like manga X, Y, and Z. We can represent this preference using the weighted average of tag vectors to form a personalization vector. Then when we query the database for similar manga to a query manga, we also average in user preferences.