So far, the cardinalities and distributions that characterize the data have been discussed. Here, one will assess the task at hand in terms of the computational workload relative to the resources you have at one’s disposal. To estimate resource requirements, let’s start with some measurements. First one should consider the resources available. So far, we’ve been using a single m4.2xlarge Amazon EC2 instance. Let’s decode that quickly. EC2 is Amazon’s Elastic Compute Cloud. Each instance is a virtual server with dedicated CPU, random access memory (RAM), and disk or solid-state online storage. The m4.2xlarge designation means a server with eight cores and 32 GB of memory. Disk space is provisioned separately. Our single instance has 1 terabyte of elastic block storage (EBS). EBS is virtualized storage, set up so that it appears that the instance has a dedicated 1 TB disk vol-ume. We’ve set up our instance to run Linux. Depending on the needs, one can easily upgrade the single instance to add cores or memory, or one can provision more instances.
Next, let’s have a look at the workload. The raw data resides in transaction files on Amazon’s Simple Storage Service, S3, which is designed to store large quantities of data inexpensively. But access is a lot slower than a local disk file. Each file contains around 1 million records. One can read approximately 30,000 records per second from S3, so if you process them one at a time, 10 billion will take about 92 hours. Downloading from S3 can be speeded up by around 75%, by processing multiple downloads in parallel (on a single instance), so that gets one down to 23 hours.
But speed isn’t the only problem. Based on an earlier observation that 10 million records loaded into memory consume 53% of your 32 GB of memory, it would take 1.7 terabytes of memory to load the entire dataset. Amazon doesn’t have an instance with that much RAM.
Fortunately, one does not need all the data in memory. Furthermore, the requirement isn’t just a function of the size of the data, but of its shape—by which we mean the cardinality of its primary keys. It turns out that there are 10 billion records, but only about 10 million users and around 300 thousand pubs, which means the user/ pub matrix is around 3 trillion entries. But when the sparse matrix is populated, there were values in only about 0.01% of the cells, so 3 trillion is reduced to 300 million. Assuming one 64-bit floating-point number per value, the user/pub matrix will fit in about 2.5 of a 32 GB.
To cut processing time, one needs to look at doing things in parallel. The figure below illustrates using worker nodes (additional EC2 instances, in this case) to ingest the raw data in parallel
The worker nodes do more than read the data from S3. Each one independently builds a sparse matrix of users and items. When all the workers are finished with their jobs, these are combined by one’s compute node.
Earlier on, some big-data technologies were described: Hadoop, MapReduce, and Apache Spark. The processes described here are a highly simplified version of what happens in a MapReduce job. A large task is broken into small units, each of which is dispatched (mapped) to a worker. As workers complete their subtasks, the results are combined (reduced), and that result is returned to the requestor. Hadoop optimizes this process in several ways. First, rather than having the workers retrieve data over a network, each worker node stores part of the data locally. Hadoop optimizes the assignment of tasks so that whenever possible, each node works on data that’s already on a local volume. Spark goes one step further by having the worker nodes load the data into memory so they don’t need to do any I/O operations in order to process the tasks they’re assigned.
Although this example problem is large enough to require a little parallel processing, it’s probably not worth the effort required to implement one of these frameworks. One needs to run the entire workflow only once per day, and one could easily add a few more instances and get the whole process down to an hour or less. But one can easily imagine an application requiring to run a variety of processes at a greater frequency, where having the worker nodes retain the raw data in memory over the course of many processing cycles would boost performance by orders of magnitude.
The goal of the model is to predict CTR for each pub. We started with user interactions as features and used SVD to reduce the feature space. From here, there are several approaches to making predictions. The first model will be a k-nearest neighbours (KNN) model. This is a simple but surprisingly effective recommender model.
One will also train a random forest regressor. Random forests are a form of decision-tree-based learning; many random samples of data and random subsets of the feature set are selected, and decision trees are constructed for each selection.
The figure below shows simplified user/item and dissimilarity matrices. The diagonal of the dissimilarity matrix is all zeros because each pub’s user vector (column in the user/item matrix) is identical to itself, and therefore zero distance from itself. One can see that the distance between pub3, pub4, and pub7 is zero, as one’d expect, because their respective columns in the user/item matrix are identical. Also note that pub1’s distance to pub5 is the same as pub5’s distance to pub1. In other words, dissimilarity is symmetric. Interestingly, some recommender algorithms don’t define distance symmetrically. Item A may be like item B, but item B isn’t like item A.
We compute the similarity (actually, dissimilarity, or distance) between each pair of pubs, using one of several available measures. We then choose the most common, the Euclidean distance.
After one has computed pairwise distances, the next step is to compute the predicted CTR for each pub. In KNN, the predicted target value is calculated by averaging the values of the target values for k-nearest neighbours, presuming that each example observation will be most similar to its nearest neighbours. There are several important questions at this juncture. First, what should you choose for the value of k? How many neighbours should be considered? Also, it’s common to give greater weight to the closest neighbours, usually by weighting the calculation of the mean target value by 1/distance or [1/distance]2.
The listing below shows a calculation of predicted values for a range of possible values of k by using scikit-learn NearestNeighbours. Here we try three weighting formulas, each of 20 values of k. Figure 10.6 shows that the best predictors are one or two nearest neighbours, and averaging over a larger range offers no real improvement. This is probably because our data is sparse, and nearest neighbours are often fairly distant. Note that the variation over the values of k is also small. In any case, the normalized RMSE for our test set predictions is in the range of 5%. Not bad!
In the training phase of random forests, data is sampled repeatedly, with replacement, in a process called bagging, sometimes called bootstrap aggregating. For each sample, a decision tree is constructed using a randomly selected subset of the features. To make predictions on unseen data, each decision tree is evaluated independently, and the results are averaged (for regression) or each tree “votes” for classification. For many applications, random forests may be outperformed by other algorithms such as boosted trees or support vector machines, but random forests have the advantages that they’re easy to apply, their results are easy to interpret and understand, and the training of many trees is easily parallelized (see figure below).
The optimized random forest regression provides a useful prediction of CTR, but it’s not as good as the KNN prediction. One’s next steps might be to explore ways to combine these, and possibly other, models. Methods that combine models in this way are called ensemble methods. Random forests are, in their own right, an ensemble method, as bagging is a way of generating multiple models. To combine entirely different models such as the two in this example, one might employ stacking, or stacked generalization, in which the predictions from multiple models become features that are combined by training and prediction using yet another ML model, usually logistic regression.
Other real-world considerations
Real-world issues that come with big data have been discussed: high dimensionality, computing resources, storage, and network data transfer constraints. As we mentioned briefly, the entire process may be replicated for several species of digital ads: mobile, video, and native. Real-time bidding and user-level personalization have an entirely different set of concerns. The data at one’s disposal may vary widely from one program to the next, and the models that work perfectly in one situation may fail entirely for another.
In our example, we had a large historical dataset to start with. But our recommender-like approach has an issue known as the cold-start problem. When a new user or a new product enters the system with no history to rely on, one has no basis for building associations. For our purposes, a few unknowns don’t matter, but when a new campaign starts from scratch, one has no history at all to work with. Models built on the basis of other similar campaigns may or may not be effective.
In the real world, there’s a great advantage to having a variety of tools and models that can be employed. The larger and more complex the environment, the greater the benefit of having such a suite of feature-building, data-reduction, training, prediction, and assessment tools well organized and built into a coherent automated workflow.
Advertising is a great example of a business in which externalities may diminish the effectiveness of one’s predictive models. As technology and business practices change, behaviours change. The growth of mobile devices has changed the digital landscape dramatically. Real-time bidding completely changes the level on which you apply optimization. New forms of fraud, ad blockers, new browsers, and new web technology all change the dynamics that one’s modeling. In the real world, models are built, tested, deployed, measured, rebuilt, retested, redeployed, and measured again.
Digital advertising is a multibillion-dollar business, and for the brands that rely on it, optimizations that reduce wasted expenditures, even a little, can have a significant return on investment. Each wasted impression one can eliminate saves money, but when replaced with one that results in gaining a customer, the benefit will be far greater than the cost savings—and will more than justify the effort to overcome the many challenges of this dynamic business.