We’ll start with a sample of 9 million observations, a small-enough sample to fit into memory in order to do some quick calculations of cardinality and distributions.

Fortunately, most users never visit most of the domains, so the user/item matrix is sparsely populated, and there are tools at one’s disposal for dealing with large, sparse matrices. And nobody said that users and domains must be the rows and columns of a gigantic matrix, but it turns out that some valuable algorithms work exceptionally well when it’s possible to operate on a user/item matrix in memory.

The 9 million observations referenced above represent roughly 0.1% of the data. Ultimately, one needs to process roughly 10 billion impressions, and that’s just one week’s worth of data. The data is loaded from 9 million impressions into about 53% of the memory on an Amazon Web Services (AWS) instance with 32 GB of RAM.

In order to look at how data is distributed over the categorical variables, one should start by computing the cardinality of pub_domain and user_id.

The figure above shows that many domains have a small number of impressions, and a few have large numbers of impressions. In order to show the distribution graphically, we plotted the base 10 log rather than the raw frequencies (the x-axis can be thought of as 100, 101, 102…).

(The histogram of impression data shows that the distribution of the number of impressions over publisher domains is heavily skewed.)

Clicks are relatively rare, only 0.12%, or 0.0012. This is a respectable overall click-through rate. But for this example, large datasets are needed in order to have enough target examples to build the model. This isn’t unusual. We’re often trying to predict relatively rare phenomena. The capacity to process huge datasets by using big-data technologies has made it possible to apply machine learning to many whole new classes of problems.

Similarly, impression frequency by user_id is highly skewed. An average user has 2.46 impressions, but the median is 1, so a few heavy hitters pull the mean higher.

Singular value decomposition

If we look at each user as a feature of the publications they’ve interacted with, we have approximately 3.6 million features per publication, 150 billion values for the exploratory sample of data. Obviously, anyone would like to work with fewer features, and fortunately PCA makes it easy.

As it turns out, PCA has several algorithms, one of which is singular value decomposition, or SVD. SVD can be explained and interpreted mathematically in various ways, and mathematicians will recognize that our explanation here leaves out some of the beauty of the underlying linear algebra. Fortunately, SVD has an excellent implementation in the scikit-learn Python library. But this time, let’s do just a little bit of the matrix algebra, where dimensions are important. If A[n x p] denotes an n-by-p matrix, A can be simply multiplied by another matrix whose dimensions are p by q (for example, B[p x q]), and the result will have dimensions of n by q (say, C[n x q]). It turns out that any matrix can be factored into three components, called the left and right singular vectors and the singular values, respectively

In this example, n is the number of users, each of which is represented by a row in matrix A, and p is the number of pubs, each of which is represented by a column:

What makes this interesting is that the singular values tell us something about the importance of the features represented by the left and right singular vectors (the vectors are the rows of U and VT). In particular, the singular values show the extent to which the corresponding feature vectors are independent. Consider the implication of interdependent or covariant features. Or to make it a bit easier, imagine that two features, A and B, are identical. After feature A has been considered by the model, feature B has nothing to contribute. It contains no new information. As builders of predictive models, the features we want are independent, and each one is at least a weak predictor of the target. If there are many weak predictors, so long as their predictions are better than random, in combination they gain strength. But this phenomenon, the ensemble effect, works only when features are independent.

Let’s run SVD on our advertising data and have a look at the resulting singular values.

When running SVD, we used the k = maximum singular values parameter to limit the calculation to the 1,550 largest singular values. The figure below shows their magnitude;

one can see that there are about 1,425 nonzero values, and that beyond the 450 most independent feature vectors, the rest are highly covariant. This isn’t surprising. Although there are over 3 million users, most of them interact with very few pubs. Consider that of these, 136,000 were observed exactly once (on ebay.com, by the way). So if each user vector is a feature of the pub, ebay.com has 136,000 features that are identical.

Our SVD reduced more than 3 million features to around 7 thousand, a 400:1 reduction. Knowing this, one has a much better sense of the resources that will be needed.