Nick Greenquist

Building a Two-Tower Deep Learning Movie Recommender System in Pytorch (from scratch)

2024-02-04T12:05:14+00:00

Introduction

Recommender systems play a crucial role in content-serving websites such as TikTok, Amazon, Netflix, YouTube, etc, by effectively showing users relevant content. They also make these companies billions of dollars: 35% of Amazon.com’s revenue is generated by its recommendation engine.

What is a recommender system?

In my own words: a recommender system’s job is to match users to content. They score/rank which content to show based on some pre-determined metric to optimize (i.e. what content will the user find most relevant or engaging). Recommendation Systems exist because many platforms have thousands (or millions) of ‘things’ they could potentially show their users, and screen real estate is extremely valuable, so they help the platforms show only what they think the user will like. Without them, it would also be impossible for the user to sift through millions of options.

Use Cases

Recommender systems are used to decide:

What movie to watch next on Netflix
What product to buy next on Amazon
What song to listen to next on Spotify
What video to watch next on YouTube
What post to watch next on Instagram feed
What vacation to book next on Booking.com

Sneak Peek of this Post

Just to give an idea of what we are going to build here, it’s going to be a movie recommendation system that can give us recommendations for users like:

TLDR: Please follow this link to go straight to the Colab notebook with the PyTorch code discussed in this post.

Why this Post

There are already hundreds (probably thousands) of posts/tutorials/follow-alongs on ‘how to build a movie recommender system’. However, most of them are really really not useful. 90% of them just read in the dataset, perform cosine similarity on some vectors made out of the users and movies, show off some recommendations that make little sense, and call it a day. 10% will actually use Machine Learning, and of these, 90% will just do some variation of user-to-item matrix factorization and stop after they spit out some loss metrics (i.e. they won’t even show how it can be used).

Don’t get me wrong, MF is a great aproach to recommendation systems (I helped create a book recommendation system using it, and even helped write an entire CUDA library to parallelize it), but it’s not ‘state of the art’ anymore. The main drawback of basic MF is that does not incorporate rich user/item features in how it learns to predict ratings/interactions.

For even more information on MF, please look at these blog posts for a technical explanation of Matrix Factorization and an application of Matrix Factorization for book recommendation.

Deep Recommender Systems

The main approach I wanted to focus on is the ‘Two Tower’ Deep Learning Architecture. The idea is this: you want to recommend users to items (i.e. movies, products, songs, etc). You have features for users (what they have already watched/bought/clicked/etc, demographic information, etc.) and features for items (the genre of you movie/song, the artists/actors, the year, etc.). Can we shove all these features into a Neural Net and get good recommendations (hint: yes)?

There are some great tutorials on how to build modern day model architectures (including Two Tower models) to train a recommendation system model. Here is a list of a few of them:

BUT: they all have serious a common serious drawback: NONE show you how to ‘actually’ use these things. It’s all ‘left to the reader as an exercise.’

The Core Issue

All of the above examples (and others I could find) use the traditional approach of embedding the unique user ids (or hashes of something unique like a username) and movie ids (like an Amazon product id) to train the model. However, this means that you can ONLY perform inference for a user or item that has been trained. If you want to run inference on a user id that you did not train on, you won’t have an embedding for them and are out of luck. If a new user wanted to use this model, your only solutions are:

Retrain the entire model
Try to partial fit your new user into the model with some training steps
Find a user that is close to the new user and use their embedding

The Goal

Instead, I wanted to set out to build a model that can generalize to any user, as long as you provide even a few examples of items they alreay enjoyed (i.e. when you sign up to Nick-flix Movie Streaming, you click on a few movies you like). The model should embed features of the user, not the unique user themselves. This is different than most other tutorials in that we will NOT map each user id to an embedding. Instead, for our simple example, we will treat each user as a feature vector made up of only two pieces of information:

Their Watch History: list of movies they have liked/disliked
Their Genre Preferences: the average rating for each possible genre

This way, after the model is trained, it can be used to get recommendations for any user as long as we have even a few movies they like (maybe some genres they prefer or don’t prefer).

Benefits

There are a few benefits to this approach:

Do not need to retrain model as often
Model is more generalizable as it cannot just memorize labels for specific users
User level cold start is much less of an issue

A Note about Item Cold Start

Removing the user id from the model helps generalize to more users, and also reduces user cold start. But what about item cold start (new item, so no item id in the model)? To get around this, it’s also possible to remove the item id from the model input and only use item features as inputs.

This post does a great job explaining how having NO ids helps with cold start: Solving the Cold-Start Problem using Two-Tower Neural Networks for NVIDIA’s E-Mail Recommender Systems

As with all things, there are tradeoffs. By removing the item id, you will lose a lot of rich information about how users interact with unique items. However, you massively reduce cold start.

Some domains might be better or worse to keep or remove ids: Amazon has millions of products and probably thousands added each day that need to be recommended. They could benefit from removing unique item ids from their models. Netflix, on the other hand, might only have a few dozen movies added per month to their catalog. They might want to keep the movie id and retrain their model more frequently.

Another possibility is to train a normal model with id embeddings, and then a second model with just features. You can then mix and match the results of both.

Building a Movie Recommender

Alright, let’s start building our recommendation system!

The Dataset

In order to build a Movie recommendation system, we are going to use the MovieLens Dataset which is provided by GroupLens

In particular, we are going to use two datasets, one small and one large:

MovieLens Small - 100,000 ratings, 9,000 movies, 600 users.
MovieLens Latest - 33,000,000 ratings, 86,000 movies, 330,975 users.

The small dataset does not lead to good results, but it is better to use while building and testing our code.

We can read in this data as a Pandas Dataframe easily like this:

df_ratings = pd.read_csv('ratings.csv')
df_movies = pd.read_csv('movies.csv')

The data will consists of two Pandas Dataframes we will read in:


Ratings Table


Movies Table

Data Preprocessing

First, we need to clean the data as any ‘nan’ values can completely ruin our training (I spent many hours debugging my model, adding gradient clipping, batch norm, etc. Turns out there is a single nan value in MovieLens). We also convert movie ids to ints so they behave better as lookup keys.

# clean the ratings data
df_ratings = df_ratings.dropna()
df_ratings['movieId'] = df_ratings['movieId'].astype(int, copy=False)

Next, let’s shrink down how many movies we care about.

NOTE: This is just for memory reasons. Google colab only gives you 12GB of RAM, so I can’t createa embeddings and feature vectors for all 50k+ movies.

# let's only work with movies with enough ratings.
min_ratings_per_movie = 1000

# get the number of ratings per movie
df_movies_to_num_ratings = df_ratings.groupby('movieId', as_index=False)['rating'].count()
print("total movies in corpus: ", len(df_movies_to_num_ratings))

df_movies_to_num_ratings = df_movies_to_num_ratings.sort_values(by=['rating'], ascending=False)
df_movies_to_num_ratings = df_movies_to_num_ratings[df_movies_to_num_ratings['rating'] > min_ratings_per_movie]
print("movies with enough ratings: ", len(df_movies_to_num_ratings))

# get list of the top movies by number of ratings.
top_movies = df_movies_to_num_ratings.movieId.tolist()

# OUTPUT
# total movies in corpus:  58136
# movies with enough ratings:  2071

Movie Feature Preprocessing

Let’s start processing some important info we need for our movies.

First, let’s create a simple map to hold how many ratings each movie has.

# keep a map of movieId to number of ratings.
movieId_to_num_ratings = {}
movieId_list = df_movies_to_num_ratings.movieId.tolist()
rating_list = df_movies_to_num_ratings.rating.tolist()
for i in range(len(movieId_list)):
  movieId_to_num_ratings[movieId_list[i]] = rating_list[i]

Next, we reduce our Ratings Dataframe to get rid of any movies we don’t care about.

# reduce our df_ratings Dataframe to only rows that are top movie (to speed up later cells).
df_ratings_final = df_ratings[df_ratings.movieId.isin(top_movies)]

Next, let’s make a helpful map from each movie id to it’s title.

# map movieId to title
movieId_to_title = {}
title_to_movieId = {}

movieId_list = df_movies.movieId.tolist()
title_list = df_movies.title.tolist()

for i in range(len(movieId_list)):
  movieId = movieId_list[i]
  title = title_list[i]

  movieId_to_title[movieId] = title
  title_to_movieId[title] = movieId

Let’s take a look at our top movies

for movieId in top_movies[0:10]:
  print(movieId, movieId_to_title[movieId], movieId_to_num_ratings[movieId])

movieId	title	num_ratings
318	Shawshank Redemption, The (1994)	36414
356	Forrest Gump (1994)	33846
296	Pulp Fiction (1994)	32440
2571	Matrix, The (1999)	31830
593	Silence of the Lambs, The (1991)	30452

Next, let’s map each movie to a set() of its genres. These will be used as item features for our model (i.e. a movie will be represented as a vector of its genre information).

# map movieId to list of genres for that movie
genres = set()
movieId_to_genres = {}

movieId_list = df_movies.movieId.tolist()
genre_list = df_movies.genres.tolist()

for i in range(len(movieId_list)):
  movieId = movieId_list[i]
  if movieId not in top_movies:
    continue

  movieId_to_genres[movieId] = set()

  for genre in genre_list[i].split('|'):
    genres.add(genre)
    movieId_to_genres[movieId].add(genre)

Let’s print out an example of a movie’s genres:

print(movieId_to_genres[title_to_movieId['Matrix, The (1999)']])
# OUTUPT: {'Action', 'Sci-Fi', 'Thriller'}

Next, let’s get the average rating of every movie. This is helpful later as we want to make sure our model isn’t just recommending the most popular movies.

# for every movie, get the avg rating
df_movies_to_avg_rating = df_ratings_final.groupby('movieId', as_index=False)['rating'].mean()

movieId_to_avg_rating = {}

movieId_list = df_movies_to_avg_rating.movieId.tolist()
rating_list = df_movies_to_avg_rating.rating.tolist()
for i in range(len(movieId_list)):
  movieId_to_avg_rating[movieId_list[i]] = rating_list[i]

Movie Feature Vocab

This is a pretty important part of the data prep. Setting up a feature vocab is needed to correctly map movie ids and movie features (i.e. only genres in our case) to the correct indices in input feature vectors.

Below, we map each unique movie id we have in top_movies to a unique index i. This will allow us to look up this movie’s embedding efficiently.

# build ITEM movieId embedding mapping
item_emb_movieId_to_i = {s:i for i,s in enumerate(top_movies)}
item_emb_i_to_movieId = {i:s for s,i in item_emb_movieId_to_i.items()}

Below, we map each unique genre to an index i that will be used to set each moveie’s genres in vector form. For example, if we had 3 genres, ‘Action’, ‘Horror’, and ‘Comedy’, we could map ‘Action’ to index 0, ‘Horror’ to index 1, and ‘Comedy’ to index 2. Therefore, and genre vector representation of movie that is an ‘Action Comedy’ movie would be [1, 0, 1].

# build ITEM genre feature context
genre_to_i = {s:i for i,s in enumerate(genres)}
i_to_genre = {i:s for s,i in genre_to_i.items()}

User Feature Preprocessing

Every user will have a feature context that will mostly be their watch history. Instead of using every movie in the corpus, we can use a smaller subset. This also helps with memory issues.

num_movies_for_user_context = 250
user_context_movies = top_movies[:num_movies_for_user_context]

Next, let’s simplify our Ratings dataframe so we can much more efficiently iterate over it and create training examples.

# aggregate dataframe down into one row per user and list of their movies and ratings.
df_ratings_aggregated = df_ratings_final.groupby('userId').agg({'movieId': lambda x: list(x), 'rating': lambda y: list(y)}).reset_index()

The above code looks complicated, but actually all it is doing is finding all the rows where a userId equals a unique value, and collapsing all the values in the movieId and rating column into a list. We do this because dataframes are extremely inefficient to iterate over. We need to do this aggregation to get all the ratings for a user, so the more we can do inplace in the Dataframe, the better.

The Dataframe now looks like this (and has one row per user):


Aggregated Ratings Table

User Feature Vocab

Our user vocab will be slighly different than our movie vocab. This is because we actually will ignore the userId and not use it to look up some predefiend embedding. Instead, we will map every user to a feature vector that is made up of two parts

The movies the user has already watched
The avg rating per genre for this user (think of this as the user’s preferences)

This means it is entirely possible that two unique users have the same feature vector representation.

# build the USER context
user_context_size = len(user_context_movies) + len(genres)

user_context_movieId_to_i = {s:i for i,s in enumerate(list(user_context_movies))}
user_context_i_to_movieId = {i:s for s,i in user_context_movieId_to_i.items()}

user_context_genre_to_i = {s:i+len(user_context_movies) for i,s in enumerate(list(genres))}
user_context_i_to_genre = {i:s for s,i in user_context_genre_to_i.items()}

The full feature vector for a user will be one vector which is a concatanation of their watch history and their genre preferences. Let’s assume we have 3 movies and 3 genres.

Movies: Movie1, Movie2, Movie3
Genres: Action, Horror, and Comedy

For a user that liked Movie1 and disliked Movie3, and likes Action and Comedy but hates Horror, their feature vector would look like:

['movie1', 'movie2', 'movie3', 'action', 'horror', 'comedy']
[     1.0,      0.0,     -1.0,      1.0,     -1.0,      1.0]

Generating Training Examples

Next, we simulate real world training examples by masking out some of the user’s watched movies from their context, and using them as labels. We do not want the ‘movie to predict’ in their watch history, as we are trying to simulate the following: given the user’s other watched movies, what would they rate this new movie?

NOTE: this is not the same as a train/test split. This is just simulating how training examples would look like on a movie platform.

WARNING: In the real world, as a user watches movies organically, you’d build up their watch history naturally and train a model using their older watch history to predict their most recent watches. If we wanted to be more correct, we could sort our ratings by timestamp and use older watches to predict newer watches, but it’s not necessary for the sake of this tutorial.

percent_ratings_as_watch_history = 0.8

user_to_movie_to_rating_WATCH_HISTORY = {}
user_to_movie_to_rating_LABEL = {}

# loop over each column as this is much, much faster than going row by row.
user_list = df_ratings_aggregated['userId'].tolist()
movieId_list_list = df_ratings_aggregated['movieId'].tolist()
rating_list_list = df_ratings_aggregated['rating'].tolist()

for i in range(len(user_list)):
  userId = user_list[i]
  movieId_list = movieId_list_list[i]
  rating_list = rating_list_list[i]

  num_rated_movies = len(movieId_list)

  # ignore users with too few ratings.
  if num_rated_movies <= 5: continue

  # set up training example maps.
  user_to_movie_to_rating_WATCH_HISTORY[userId] = {}
  user_to_movie_to_rating_LABEL[userId] = {}

  # shuffle the user's movies that they have watched
  rated_movies = list(zip(movieId_list, rating_list))
  random.shuffle(rated_movies)

  # put some movies into user's watch history (features) and leave others as labels to predict.
  for movieId,rating in rated_movies[:int(num_rated_movies * percent_ratings_as_watch_history)]:
    user_to_movie_to_rating_WATCH_HISTORY[userId][movieId] = rating
  for movieId,rating in rated_movies[int(num_rated_movies * percent_ratings_as_watch_history):]:
    user_to_movie_to_rating_LABEL[userId][movieId] = rating

Set up Feature Vectors

First, we need each user’s average rating. This is so we can de-bias each rating. If the user’s rating is above their average for a movie, we will treat that as a positive value in their vector. Opposite for ratings below their average. This helps the model learn likes and dislikes.

user_to_avg_rating = {}

# NOTE: only use ratings from their synthetic watch history.
for user in user_to_movie_to_rating_WATCH_HISTORY.keys():
  user_to_avg_rating[user] = 0
  for movieId in user_to_movie_to_rating_WATCH_HISTORY[user].keys():
    user_to_avg_rating[user] += user_to_movie_to_rating_WATCH_HISTORY[user][movieId]

  user_to_avg_rating[user] /= len(user_to_movie_to_rating_WATCH_HISTORY[user].keys())

Next, let’s get each user’s preference for each genre. We will compute the user’s average rating for each genre and de-bias it.

# for every user, get the avg rating for every genre
user_to_genre_to_stat = {}

# NOTE: only use ratings from their synthetic watch history.
for user in user_to_movie_to_rating_WATCH_HISTORY.keys():
  user_to_genre_to_stat[user] = {}
  for movieId in user_to_movie_to_rating_WATCH_HISTORY[user].keys():
    for genre in movieId_to_genres[movieId]:
      if genre not in user_to_genre_to_stat[user]:
        user_to_genre_to_stat[user][genre] = {
            'NUM_RATINGS': 0,
            'SUM_RATINGS': 0,
        }

      user_to_genre_to_stat[user][genre]['NUM_RATINGS'] += 1
      user_to_genre_to_stat[user][genre]['SUM_RATINGS'] += user_to_movie_to_rating_WATCH_HISTORY[user][movieId]

for user in user_to_genre_to_stat.keys():
  for genre in user_to_genre_to_stat[user].keys():
    num_ratings = user_to_genre_to_stat[user][genre]['NUM_RATINGS']
    sum_ratings = user_to_genre_to_stat[user][genre]['SUM_RATINGS']
    user_to_genre_to_stat[user][genre]['AVG_RATING'] = sum_ratings / num_ratings

Finaly, we can build a feature ‘context’ vector for every user using their watch history and genre preferences.

# for every user, create the training example user context vector
# 0:num_user_context_movies -> user's watch history
# num_user_context_movies:num_user_context_movies+num_genres -> user's genre affinity
user_to_context = {}
for user in user_to_movie_to_rating_WATCH_HISTORY.keys():
  context = [0.0] * user_context_size

  for movieId in user_to_movie_to_rating_WATCH_HISTORY[user].keys():
    if movieId in user_context_movies:
      # note, we debias the rating so if the rating is under the user's avg rating,
      # it will hopefully count as negative strength for predicting similar movies.
      # vice-versa for a rating above the user's average.
      context[user_context_movieId_to_i[movieId]] = float(user_to_movie_to_rating_WATCH_HISTORY[user][movieId] - user_to_avg_rating[user])

  for genre in user_to_genre_to_stat[user].keys():
    context[user_context_genre_to_i[genre]] = float(user_to_genre_to_stat[user][genre]['AVG_RATING'] - user_to_avg_rating[user])

  user_to_context[user] = context

We also need to set up the feature vector for each movie. This is much simpler since it’s just a binary mask vector for every genre the movie has.

# for every movie, create a training example feature context vector lookup
# it will contain the movie's genres.
movieId_to_context = {}
for movieId in top_movies:
  context = [0.0] * len(genres)

  for genre in movieId_to_genres[movieId]:
    context[genre_to_i[genre]] = float(1.0)

  movieId_to_context[movieId] = context

Designing our Model Architecture

Before we build the final dataset, it would be helpful to know why it will be the way it is. Unlike most datasets you might have seen with just an X and Y matrix to hold inputs and labels, we are building a Two Tower model (technically it has 3 inputs).

Here is our model and I’ll explain it in detail.

'''
user_features ---------------> u_W1
                                    \
                                     \
                                      --> dot_product(user, item) --> prediction
                                     /
movie_features  -> i_W1             /
                        \          /
                         --> stack
                        /
movie_embedding -> e_W1
'''

Our model will have 3 inputs that each feed into a non-linear layer.

The user’s context feature vector
The movie’s context feature vector
The movie’s id embedding vector

The final output is a prediction for the user’s rating for the movie. This prediction is based on the user’s feeatures, the movie’s features, and the learned embedding for the unique movie id.

To get the prediction, we concatanate the hidden embeddings from both movie inputs, and compute the dot product with the ‘combined movie embedding vector’ and the ‘user embedding vector’.

We will train this model by simply computing the loss of the actual user’s rating for this movie versus the predicted loss.

Backpropogation works normally even for this ‘Two Tower’ model.

Building our Dataset

Now that you understand the inputs and output of our model, let’s actually build the Dataset. It consists of 4 parts:

The user context feature vectors, held in matrix X
The target movie’s id, held in vector target_movieId
The target movie’s context feature vectors, held in matrix target_movieId_context
The target movie’s actual rating, held in vector Y

Each part of the Dataset will be converted to a Pytorch Tensor so it can be used in Pytorch funtions to feed and train the model.

# Build the final Dataset
def build_dataset(users):
  # the user context (i.e. the watch hisotyr and genre affinities)
  X = []

  # the movieID for the movie we will predict rating for.
  # used to lookup the movie embedding to feed into the NN item tower.
  target_movieId = []

  # the feature context of the movie we will predict the rating for.
  # will also feed into it's own embedding and will be stacked with the embedding above.
  target_movieId_context = []

  # the predicted rating
  Y = []

  # create training examples, one for each movie the user has that we want as a label.
  for user in users:
    for movieId in user_to_movie_to_rating_LABEL[user].keys():
      X.append(user_to_context[user])

      target_movieId.append(item_emb_movieId_to_i[movieId])

      target_movieId_context.append(movieId_to_context[movieId])

      # remember to debias the user rating so we can learn to predict if user
      # like/dislike a movie based on their features and the movie features.
      Y.append(float(user_to_movie_to_rating_LABEL[user][movieId] - user_to_avg_rating[user]))

  X = torch.tensor(X)
  Y = torch.tensor(Y)
  target_movieId = torch.tensor(target_movieId)
  target_movieId_context = torch.tensor(target_movieId_context)

  return X,Y,target_movieId,target_movieId_context

Train/Validation Split

Before we call the build_dataset function, let’s split up some users into Train and some users into Validation.

WARNING: It would be more correct to shuffle the Dataset and include some training examples into our training set and others into validation set. But for the simplicity of this example, I will simply use some users for training and some for validation.

# user users with enough ratings to predict to be useful for model learning.
final_users = []

for user in user_to_movie_to_rating_LABEL.keys():
  num_ratings = len(user_to_movie_to_rating_LABEL[user])

  if num_ratings >= 2 and num_ratings < 500:
    final_users.append(user)

# split users into train and validation users
percent_users_train = 0.8

random.shuffle(final_users)

train_users = final_users[:int(len(final_users) * percent_users_train)]
validation_users = final_users[int(len(final_users) * percent_users_train):]

Finally, let’s get our training and valadation Datasets.

X_train, Y_train, target_movieId_train, target_movieId_context_train = build_dataset(train_users)
X_val, Y_val, target_movieId_val, target_movieId_context_val = build_dataset(validation_users)

Building our Model

Below, we will actually build from scratch the entire model (i.e. all the weights and biases). Notice the input dimensions of each of the parts. Each weight matrix links up to one of our inputs. i_W1 will match the dimensions of the movie feature vector, which is the number of genres. e_W1 is any size we want since we are creating an ITEM_EMBEDDING_LOOKUP. If this concept is confusing, please see Andrej Karpathy’s amazing series: Neural Networks: Zero to Hero.

NOTE: using an embedding lookup table is no different than simply creating a one-hot encoded vector of size len(top_movies) and just multiplying it by some matrix. However, that way is extremely inefficient and I was unable to train any decently sized model due to RAM constraints.

Few other small points:

I scale the weights down a little to prevent early training iterations having crazy loss. If this was an actual production model, I’d probably apply BatchNorm on it.
I’m using MSELoss which means we will get the average ‘Squared Error’ loss on all examples. This just means we square the difference of the real rating vs the predicted rating.
We set a batch size of 64.
I create two lists to hold our loss for our full training set and validation set. We will plot this later.

g = torch.Generator().manual_seed(42)

# ITEM movie feature tower
item_feature_embedding_size = 25
i_W1 = torch.randn((len(genres), item_feature_embedding_size), generator=g)
i_b1 = torch.randn(item_feature_embedding_size, generator=g)

# ITEM movie embedding tower
item_movieId_embedding_size = 25
ITEM_EMBEDDING_LOOKUP = torch.rand((len(top_movies), item_movieId_embedding_size), generator=g)
e_W1 = torch.randn((item_movieId_embedding_size, item_movieId_embedding_size), generator=g)
e_b1 = torch.randn(item_movieId_embedding_size, generator=g)

# USER feature tower
user_feature_embedding_size = 50 # must be the concat dimension of both item embeddings.
u_W1 = torch.randn((user_context_size, user_feature_embedding_size), generator=g)
u_b1 = torch.randn(user_feature_embedding_size, generator=g)

# create a list of all our TRAINABLE params
parameters = [
    i_W1, i_b1,
    ITEM_EMBEDDING_LOOKUP, e_W1, e_b1,
    u_W1, u_b1,
]

# normalize the initial weight values.
weight_scale = 0.1
for p in parameters:
  p *= weight_scale

# set all parameters to require gradients
for p in parameters:
  p.requires_grad = True

# print number of trainable params in our NN
print(sum(p.nelement() for p in parameters))

# set the loss function we want to use.
# we use MSE Loss because we are predicting the rating per label movie.
loss = torch.nn.MSELoss()

# set how big we want each minibatch to be
minibatch_size = 64

# create list to hold our loss per training iterations
loss_train = []
loss_val = []

Training our Model

Below is the actual code to train our model without the use of any Pytorch library. I write this all out manually versus using a Torch Module so we can control and study every single part and really understand each step.

Some notes:

Every 1000 iterations, we will compute our loss on the full validation set.
If we are doing a validation run, we will not backprop (won’t train)
If we are doing a full validation run, we will use our validation Dataset pieces.
torch.einsum is how we will do batched dot products of our user and movie embeddings to get the final prediction.
We will gradually decrease our learning rate. We could use an optimizer, but I wanted to avoid Torch libraries.
We record our avg loss during training and for each full validation runs.

log_every = 1000

for i in range(0, 50_000):

  # every so often, let's train and compute loss on entire validation set
  is_full_val_run = False
  if i % log_every == 0:
    is_full_val_run = True

  # select training example inputs we use for this run, and minibatch indices.
  X = X_train
  Y = Y_train
  target_movieId_context = target_movieId_context_train
  target_movieId = target_movieId_train
  if is_full_val_run:
    X = X_val
    Y = Y_val
    target_movieId_context = target_movieId_context_val
    target_movieId = target_movieId_val

  # construct a minibatch
  ix = torch.randint(0, X.shape[0], (minibatch_size,))
  if is_full_val_run:
    ix = torch.randint(0, X.shape[0], (X.shape[0],))

  # ---------- FORWARD PASS ----------

  # forward the USER tower.
  user_contexts = X[ix]
  user_embedding = torch.tanh(user_contexts @ u_W1 + u_b1)

  # forward the ITEM movie feature tower
  movie_contexts = target_movieId_context[ix]
  item_feature_embedding = torch.tanh(movie_contexts @ i_W1 + i_b1)

  # lookup the ITEM movieId embedding and pass through non-linear layer.
  # NOTE: this is just a shortcut to multiplying a one-hot vector with the masked movieID with a weight matrix.
  item_embedding_hidden = torch.tanh(ITEM_EMBEDDING_LOOKUP[target_movieId[ix]] @ e_W1 + e_b1)

  # concat/stack the two ITEM embeddings together
  item_embedding_combined = torch.cat((item_feature_embedding.view(item_feature_embedding.size(0), -1),
                                       item_embedding_hidden.view(item_embedding_hidden.size(0), -1)), dim=1)

  # the final prediction is the dot product of the user embedding and the combined item embedding.
  # NOTE: because we have a batch of these, we will use torch.einsum to do this efficiently.
  preds = torch.einsum('ij, ij -> i', user_embedding, item_embedding_combined)

  # compute the loss of our predicted ratings
  output = loss(preds, Y[ix])

  # backpropogation and update weights (except on validation runs)
  if not is_full_val_run:
    for p in parameters:
      p.grad = None

    output.backward()

    # update weights using gradients * learning_rate
    lr = 0.1
    if i >= 10_000: lr = 0.05
    if i >= 20_000: lr = 0.01
    if i >= 30_000: lr = 0.005
    for p in parameters:
      p.data += (lr * -p.grad)

  # every so often, log the MSE loss on full val set (see above)
  if is_full_val_run:
    loss_val.append(output.item())

    if i >= log_every:
      avg_train_loss_last_batches = np.mean(loss_train[i-log_every:i])
    else:
      avg_train_loss_last_batches = output.item()
    print("[TRAIN] i: ", i, " | ", "loss: ", avg_train_loss_last_batches)
    print("[VAL] i: ", i, " | ", "loss: ", output.item())
    print()
  else:
    loss_train.append(output.item())

As the model trains, we should see something liks this being printed:

[TRAIN] i:  0  |  loss:  0.9906623363494873
[VAL] i:  0  |  loss:  1.0060031414031982

[TRAIN] i:  1000  |  loss:  0.8815735578536987
[VAL] i:  1000  |  loss:  0.896892786026001

[TRAIN] i:  2000  |  loss:  0.8524652123451233
[VAL] i:  2000  |  loss:  0.8721503019332886

Finally, we can plot our training and validation losses versus each iteration.

loss_train_bucket_means = []
for i in range(0, len(loss_train), log_every):
  loss_train_bucket_means.append(np.mean(loss_train[i:i+log_every]))

plt.plot([i*1000 for i in range(len(loss_train_bucket_means))], loss_train_bucket_means)
plt.plot([i*1000 for i in range(1, len(loss_val))], loss_val[1:])

It will look something like this:


Example train vs val loss plot

Actually Using our Model

Now for the fun part: let’s actually use this trained model to generate recommendations for different types of users we might see (if we were Netflix for example).

Precomputing Movie Embeddings

In oder to get recommendations, we will feed in a new user feature vector through our model, and get a predicted rating for every movie in top_movies. To get a prediction for a movie, we need the item embedding in order to do the dot product with the user embedding. For different users, we don’t need to recompute these embeddings: once we have them for every movie in our catalog, we can re-use them!

We can compute our final embeddings for every movie all at once, then save them to a lookup map, and then easily use them later for any user (no need to ever do a forward pass in the Item Tower).

# for every movie, save all its embeddings
movieId_to_embedding = {}

for movieId in top_movies:
  movieId_to_embedding[movieId] = {}

  item_embedding = ITEM_EMBEDDING_LOOKUP[torch.tensor([item_emb_movieId_to_i[movieId]])]
  movieId_to_embedding[movieId]['MOVIEID_EMBEDDING'] = torch.tanh(item_embedding @ e_W1 + e_b1)

  movieId_to_embedding[movieId]['MOVIE_FEATURE_EMBEDDING'] = torch.tanh(torch.tensor([movieId_to_context[movieId]]) @ i_W1 + i_b1)

  # compute the combined (concat) item/movie embedding
  item_id_emb = movieId_to_embedding[movieId]['MOVIEID_EMBEDDING']
  item_feature_emb = movieId_to_embedding[movieId]['MOVIE_FEATURE_EMBEDDING']
  movieId_to_embedding[movieId]['MOVIE_EMBEDDING_COMBINED'] = torch.cat((item_feature_emb.view(item_feature_emb.size(0), -1),
                                       item_id_emb.view(item_id_emb.size(0), -1)), dim=1)

Finding Most Similar Movies

Since we now have a vector representation of every movie, we can easily find each movie’s most similar movies. This can be useful by itself and companies like Amazon use similar embeddings to power things like ‘Similar to What you just Bought’.

# for every movie, and for every embedding type, find the similary to all other embeddings
# NOTE: can be slow
movieId_to_emb_type_to_similarities = {}

for movieId in top_movies:
  movieId_to_emb_type_to_similarities[movieId] = {}

  for emb_type in movieId_to_embedding[movieId].keys():
    emb_to_target_to_dist = {}
    for target_id in top_movies:
      src = movieId_to_embedding[movieId][emb_type].view(-1)
      target = movieId_to_embedding[target_id][emb_type].view(-1)

      distance = torch.sqrt(torch.sum(torch.pow(torch.subtract(src, target), 2), dim=0))
      emb_to_target_to_dist[target_id] = distance.item()
    movieId_to_emb_type_to_similarities[movieId][emb_type] = sorted(emb_to_target_to_dist.items(), key=lambda item: item[1])

Most Similar to: Lord of the Rings: The Return of the King, The (2003)

Lord of the Rings: The Fellowship of the Ring, The (2001)
Lord of the Rings: The Two Towers, The (2002)
Hobbit: An Unexpected Journey, The (2012)
Gladiator (2000)
Dune (2021)

Most Similar to: Star Wars: Episode IV - A New Hope (1977)

Star Wars: Episode V - The Empire Strikes Back (1980)
Star Wars: Episode VI - Return of the Jedi (1983)
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Indiana Jones and the Last Crusade (1989)
Ghostbusters (a.k.a. Ghost Busters) (1984) 

Most Similar to: Toy Story (1995)

Toy Story 2 (1999)
Toy Story 3 (2010)
Monsters, Inc. (2001)
Inside Out (2015)
Bug's Life, A (1998)

Most Similar to: Saving Private Ryan (1998)

Braveheart (1995)
Black Hawk Down (2001)
Last of the Mohicans, The (1992)
Untouchables, The (1987)
Dirty Dozen, The (1967)

Most Similar to: Kill Bill: Vol. 1 (2003)

Ronin (1998)
French Connection, The (1971)
Run Lola Run (Lola rennt) (1998)
Sin City (2005)
Fistful of Dollars, A (Per un pugno di dollari) (1964)

Most Similar to: American Pie (1999)

American Pie 2 (2001)
Liar Liar (1997)
Wedding Singer, The (1998)
Meet the Parents (2000)
Wedding Crashers (2005)

Most Similar to: Princess Mononoke (Mononoke-hime) (1997)

Spirited Away (Sen to Chihiro no kamikakushi) (2001)
Howl's Moving Castle (Hauru no ugoku shiro) (2004)
Spider-Man: Into the Spider-Verse (2018)
Akira (1988)
Ghost in the Shell (Kôkaku kidôtai) (1995)

Inference: Getting Recommendations

Now for the best part: let’s actually get some recommendations for different types of movie lovers!

To get recommendations, we want to build the user’s feature vector based on the genres they like/dislike and the movies they liked/disliked. Then, we pass the user’s context through the user weight matrix u_W1 and that gives us the final user embedding. We then just compute the dot product with the combined item embedding of all movies and we will get a predicted rating for every movie!

So, the general flow is like this:

Construct the user feature vector
Pass it through the User Tower to get user embedding
Iterate over all movies, and compute the dot product of the user embedding with each movie’s combined embedding
Sort by predicted score and return the top movies.

It will look something like this:

user_context = get_user_context() # placeholder for now
X_inference = torch.tensor([user_context])
user_embedding_inference = torch.tanh(X_inference @ u_W1 + u_b1)

movieId_to_pred_score = {}
for movieId in top_movies:
  # we already have the combined item embedding for every movie to make inference easier.
  item_embedding_combined_inference = movieId_to_embedding[movieId]['MOVIE_EMBEDDING_COMBINED']
  movieId_to_pred_score[movieId] = torch.einsum('ij, ij -> i', user_embedding_inference, item_embedding_combined_inference).item()

Let’s build some synthetic users and we will use their user contexts to generate new recommendations for them.

user_type_to_favorite_genres = {
    'Fantasy Lover': ['Fantasy'],
    'Children\'s Movie Lover': ['Children'],
    'Horror Lover': ['Horror'],
    'Sci-Fi Lover': ['Sci-Fi'],
    'Comedy Lover': ['Comedy'],
    'Romance Lover': ['Romance'],
    'War Movie Lover': ['War']
}

user_type_to_worst_genres = {
    'Fantasy Lover': ['Horror', 'Children'],
    'Children\'s Movie Lover': ['Horror', 'Romance', 'Drama'],
    'Horror Lover': ['Children'],
    'Sci-Fi Lover': ['Romance', 'Children'],
    'Comedy Lover': ['Children'],
    'Romance Lover': ['Children', 'Horror'],
    'War Movie Lover': ['Children']
}

user_type_to_favorite_movies = {
    'Fantasy Lover': [
        'Lord of the Rings: The Fellowship of the Ring, The (2001)',
        'Gladiator (2000)',
        '300 (2007)',
        'Braveheart (1995)'
        ],
    'Children\'s Movie Lover': [
        'Toy Story 2 (1999)',
        'Finding Nemo (2003)',
        'Monsters, Inc. (2001)'
        ],
    'Horror Lover': [
        'Blair Witch Project, The (1999)',
        'Silence of the Lambs, The (1991)',
        'Sixth Sense, The (1999)'
        ],
    'Sci-Fi Lover': [
        'Star Wars: Episode V - The Empire Strikes Back (1980)',
        'Matrix, The (1999)',
        'Terminator, The (1984)'
        ],
    'Comedy Lover': [
        'American Pie (1999)',
        'Dumb & Dumber (Dumb and Dumber) (1994)',
        'Austin Powers: The Spy Who Shagged Me (1999)',
        'Big Lebowski, The (1998)'
      ],
    'Romance Lover': [
        'Shakespeare in Love (1998)',
        'There\'s Something About Mary (1998)',
        'Sense and Sensibility (1995)'
    ],
    'War Movie Lover': [
        'Saving Private Ryan (1998)',
        'Apocalypse Now (1979)',
        'Full Metal Jacket (1987)'
    ]
}

user_to_inference_context = {}

for user_type in user_type_to_favorite_genres.keys():
  inference_user_context = [0.0] * user_context_size

  # set genres the user likes.
  for genre in user_type_to_favorite_genres[user_type]:
    inference_user_context[user_context_genre_to_i[genre]] = float(2.0)

  # set genres that the user dislikes
  for genre in user_type_to_worst_genres[user_type]:
    inference_user_context[user_context_genre_to_i[genre]] = float(-2.0)

  # set the user's favorite movies.
  for title in user_type_to_favorite_movies[user_type]:
    movieId = title_to_movieId[title]
    inference_user_context[user_context_movieId_to_i[movieId]] = float(2.0)

  user_to_inference_context[user_type] = inference_user_context

Get their top recommendations:

user_to_top_recs = {}

for user_type in user_to_inference_context.keys():

  X_inference = torch.tensor([user_to_inference_context[user_type]])
  user_embedding_inference = torch.tanh(X_inference @ u_W1 + u_b1)

  movieId_to_pred_score = {}
  for movieId in top_movies:
    # we already have the combined item embedding for every movie to make inference easier.
    item_embedding_combined_inference = movieId_to_embedding[movieId]['MOVIE_EMBEDDING_COMBINED']
    movieId_to_pred_score[movieId] = torch.einsum('ij, ij -> i', user_embedding_inference, item_embedding_combined_inference).item()

  top_recs = []
  for movieId, pred_score in list(sorted(movieId_to_pred_score.items(), key=lambda item: item[1], reverse=True)):
    if len(top_recs) >= 10: break
    if movieId_to_title[movieId] not in user_type_to_favorite_movies[user_type]:
      top_recs.append(movieId)
  user_to_top_recs[user_type] = top_recs

Example Recommendations

Horror Lover

Hello, Horror Lover
Because you like: [Horror]
And hate: [Children]

And enjoyed these movies:
Blair Witch Project, The (1999)
Silence of the Lambs, The (1991)
Sixth Sense, The (1999)

You should watch:
Alien (1979)
Videodrome (1983)
Thing, The (1982)
Aliens (1986)
Psycho (1960)
Evil Dead, The (1981)
Shining, The (1980)
Night of the Living Dead (1968)
Invasion of the Body Snatchers (1956)
Get Out (2017)

Children’s Movie Lover

Hello, Children's Movie Lover
Because you like: [Children]
And hate: [Horror,Romance,Drama]

And enjoyed these movies:
Toy Story 2 (1999)
Finding Nemo (2003)
Monsters, Inc. (2001)

You should watch:
Zootopia (2016)
Kung Fu Panda 2 (2011)
Incredibles, The (2004)
Madagascar: Escape 2 Africa (2008)
Kung Fu Panda (2008)
Bolt (2008)
The Lego Movie (2014)
Megamind (2010)
Rango (2011)
Goonies, The (1985)

Sci-Fi Lover

Hello, Sci-Fi Lover
Because you like: [Sci-Fi]
And hate: [Romance,Children]

And enjoyed these movies:
Star Wars: Episode V - The Empire Strikes Back (1980)
Matrix, The (1999)
Terminator, The (1984)

You should watch:
Spider-Man: Into the Spider-Verse (2018)
Blade Runner (1982)
Aliens (1986)
Star Wars: Episode IV - A New Hope (1977)
Nausicaä of the Valley of the Wind (Kaze no tani no Naushika) (1984)
Akira (1988)
Alien (1979)
Thing, The (1982)
Inception (2010)
Cowboy Bebop: The Movie (Cowboy Bebop: Tengoku no Tobira) (2001)

Comedy Lover

Hello, Comedy Lover
Because you like: [Comedy]
And hate: [Children]

And enjoyed these movies:
American Pie (1999)
Dumb & Dumber (Dumb and Dumber) (1994)
Austin Powers: The Spy Who Shagged Me (1999)
Big Lebowski, The (1998)

You should watch:
Sting, The (1973)
Thin Man, The (1934)
Kung Fu Hustle (Gong fu) (2004)
Some Like It Hot (1959)
Snatch (2000)
Midnight Run (1988)
What We Do in the Shadows (2014)
Office Space (1999)
21 Jump Street (2012)
Legend of Drunken Master, The (Jui kuen II) (1994)

Fantasy Lover

Hello, Fantasy Lover
Because you like: [Fantasy]
And hate: [Horror,Children]

And enjoyed these movies:
Lord of the Rings: The Fellowship of the Ring, The (2001)
Gladiator (2000)
300 (2007)

You should watch:
Princess Bride, The (1987)
Lord of the Rings: The Return of the King, The (2003)
Lord of the Rings: The Two Towers, The (2002)
Spirited Away (Sen to Chihiro no kamikakushi) (2001)
Monty Python and the Holy Grail (1975)
Yojimbo (1961)
Wings of Desire (Himmel über Berlin, Der) (1987)
Seven Samurai (Shichinin no samurai) (1954)
Star Wars: Episode IV - A New Hope (1977)
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)

Romance Lover

Hello, Romance Lover
Because you like: [Romance]
And hate: [Children,Horror]

And enjoyed these movies:
Shakespeare in Love (1998)
There's Something About Mary (1998)
Sense and Sensibility (1995)

You should watch:
Life Is Beautiful (La Vita è bella) (1997)
Casablanca (1942)
Roman Holiday (1953)
Shawshank Redemption, The (1994)
Singin' in the Rain (1952)
Rebecca (1940)
Good Will Hunting (1997)
Forrest Gump (1994)
Pride & Prejudice (2005)
Modern Times (1936)

War Movie Lover

Hello, War Movie Lover
Because you like: [War]
And hate: [Children]

And enjoyed these movies:
Saving Private Ryan (1998)
Apocalypse Now (1979)
Full Metal Jacket (1987)

You should watch:
Schindler's List (1993)
Shawshank Redemption, The (1994)
Boot, Das (Boat, The) (1981)
Godfather, The (1972)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
Grave of the Fireflies (Hotaru no haka) (1988)
Great Dictator, The (1940)
Ran (1985)
Pulp Fiction (1994)
Lawrence of Arabia (1962)

Anti-Recommendations

What are the WORST movies for certain types of users? Do we get their least favorite genres?

To get anti-recommendations, we just print the bottom 10 (lowest predicted score) recommendations.

Children’s Movie Lover - Anti Recs

Hilariously, among the worst movies for someone who likes Children’s Movies, hates Horror and Romance, are Twilight and Nightmare on Elm Street. Those are definitely the worst possible movies for this user.

Hello, Children's Movie Lover
Because you like: [Children]
And hate: [Horror,Romance,Drama]

And enjoyed these movies:
Toy Story 2 (1999)
Finding Nemo (2003)
Monsters, Inc. (2001)

You should NOT watch:
Twilight Saga: New Moon, The (2009)
Legends of the Fall (1994)
Twilight (2008)
Boxing Helena (1993)
Lost Highway (1997)
Wolf (1994)
Bodyguard, The (1992)
Amityville Horror, The (1979)
Wes Craven's New Nightmare (Nightmare on Elm Street Part 7: Freddy's Finale, A) (1994)
Mulholland Drive (2001)

Horor Movie Lover - Worst Recs

For someone who loves Horror and hates Children’s movies, recommending Home Alone 3, Karate Kid, Free Willy, and Happy Feet are delightfully bad recommendations.

Hello, Horror Lover
Because you like: [Horror]
And hate: [Children]

And enjoyed these movies:
Blair Witch Project, The (1999)
Silence of the Lambs, The (1991)
Sixth Sense, The (1999)

You should NOT watch:
Inspector Gadget (1999)
Next Karate Kid, The (1994)
Home Alone 3 (1997)
Free Willy 2: The Adventure Home (1995)
Pocahontas (1995)
Super Mario Bros. (1993)
Happy Feet (2006)
Free Willy (1993)
Karate Kid, Part III, The (1989)
Honey, I Blew Up the Kid (1992)

Possible Improvements

Below are some possible ways to improve this current movie recommendation system:

Add better movie features
- Use the movie’s year from the title as a feature (some people might like movies from certain eras)
- Use the movie’s title as a bag of words feature (helps model find similar movies based on title)
- Find and use the movie’s director/publisher (helps the model find similar movies based on who made them)
- Find and use the movie’s actors (some people have favorite actors)
- Leverage MovieLen’s ‘movie tags’ (requires lot of preprocessing but could be useful)
Add better user features
- Find and use the user’s demographic features
- Leverage MovieLen’s user occupation feature (only available for some datasets)
- Use the user’s favorite ‘decade’ as a kind of genre feature (would work well if we added movie’s decade-year)
- Use the user’s favorite directors
- Use the user’s favorire actors
Generate true training examples using the ratings timestamps
- Instead of randomly picking some movies to predict rating for, we can always use the last movies watched by the user as the labels and their earliest movie watches as their watch history
Add an attention/context component to the model [ADVANCED]
- Transformers are all the rage, and because users have an order in how they watched movies, we could use them.
Add movie reviews from IMDB or Rotten Tomatoes as text or sentiment features [ADVANCED]
- Reviews contain rich information about movies and also sentiment about if they are good or bad (in aggregate)
Build a Deeper model
- This model is only one non-linear layer per input. We could go much deeper, ever after combining both user and item embeddings.
Improve the model
- Add BatchNorm, Dropout

Appendix

Visualizing Movies in 2D

plt.figure(figsize=(15,15))
for movieId in top_movies[0:25]:
  i = item_emb_movieId_to_i[movieId]
  plt.scatter(ITEM_EMBEDDING_LOOKUP[i,0].data, ITEM_EMBEDDING_LOOKUP[i,1].data, s=200)
  plt.text(ITEM_EMBEDDING_LOOKUP[i,0].item(), ITEM_EMBEDDING_LOOKUP[i,1].item(), movieId_to_title[movieId][0:20], ha="center", va="center", color='black')
plt.grid('minor')

Example Training Runs

Below we will train the model on different datasets and with different model parameters.

Dataset	Model Size	Observation
Small	Small	Overfitting
Small	Medium	Extreme Overfitting
Medium	Small	Loss is not great. Could keep learning but probably not worth it
Medium	Medium	Training loss getting better but slight overfitting
Medium	Large	same as above
Large	Small	No overfitting, but not learning well
Large	Medium	No overfitting, but hitting a wall
Large	Large	Looking much better. Loss much lower than pervious runs

Applying Recommendations to Other Domains

NOTE: The below user/item features might be subpar for some domains. These ideas are how I would approach recommending items in each domain as a start. If you want more advanced descriptions, companies usually post techincal blogs about how they recommend content.

Domain	User Features	Item Features
Online Shopping	Purchases Returns (negative signal) Reviews/Ratings Favorite Categories Country/City/State Yearly Purchase Count Yearly Purchase USD	Brand Title Price Category Reviews Number of Returns
Books	Liked/Disliked Books Liked/Disliked Genres Liked/Disliked Authors	Title Genres Book text (bag of words) Published Year
Music Streaming	Listened to Songs (with counts) Listened to Artists (with counts) Favorite Genres Time of Day of Listens Day of week of listens Country/State/City Language	Title Genre Artist Listens Release Date ADVANCED: Embedding of Audio File
Social Media	Liked Posts Posts with > X seconds hover Following Account Ids Followers Account Ids Country/State/City Language Comments	Account Id Bag of Words of Text/Caption Views per hour Likes Comments Comments Text ADVANCED Embedding of Image
Hotel Bookings	Past Bookings: Location Past Bookings: Hotel Favorite Hotel Brands Favorite Amenities Location Booking Count Viewed/Clicked Locations Viewed/Clicked Hotels Country Language Bookings in past year USD Spend in past year	Location Brand Price Stars Reviews Amenities
Ads	Ad Purchase History Ad Click History Ad Spend USD last year Favorite Brands Country/City/State Age Gender Dependent on site serving the ads: Interests Page Clicks Searches	Brand Product Categories Product Price(s) Clicks Click thru Rate Purchases Impressions Long Views

35 Books that Changed My Life

2020-09-04T14:05:14+00:00

Reading is one of the most important parts of my life, and has been since I was a little kid. In this post, I want to list out all the books that I feel have changed my life for the better. I won’t summarize or say what the book is about, but only briefly explain how each book changed me.

This also is not a list of my favorite books (altough many here I would count as such). Instead it is the books that I felt changed me the most after finishing them.

You can find all the books I have read here.

Personal Development

How to Win Friends and Influence People

For teaching me how to change from being one of the most antisocial kids in middle and high school to being called a very sociable extrovert.
Siddhartha

For teaching me that extremes are never the answer in the search for bettering yourself, and it’s better to just live simply.
Talent Is Overrated

For teaching me that it’s not just how many hours you put into something, but that you need deliberate practice.
Grit: The Power of Passion and Perseverance

For teaching me that hard work beats talent that doesn’t work.
The Curmudgeon’s Guide to Getting Ahead

For teaching me no-nonsense, non-PC practical advice for life.
Springboard to Success

For teaching me how to organize and reflect about what’s important to me.
Models: Attract Women through Honesty

For teaching me that people (not just romantic interests) are attracted to people who are honest with who they are, are open with what they want, and cut the bullshit.
The One Thing

For teaching me to not multitask with my goals and self improvement, and instead hyper focus on a small amount of goals at a time.

Non-Fiction

The Rise of Theodore Roosevelt

For teaching what it means to be a ‘jack of all trades’ kind of person and how much is possible to accomplish in a short lifetime.
Yes Man

For teaching me to say ‘Yes’ to most things. I actually tried this for a year and it changed my life drastically.
Man’s Search for Meaning

For teaching me what is important in life when you have nothing to live for.
Steve Jobs

For teaching me that world builders in their specialized fields can be assholes in their personal lives, and to keep them separate when judging someone in either.
Thinking Fast and Slow

For teaching me to be aware that our brains are not rationale and we let emotions make a lot of ‘gut’ decisions (for better and worse).
One Up On Wall Street

For teaching me to try and get opinions on things, not just from the experts in that field, but people affected by it, because the experts are often trapped in a bubble.
The Selfish Gene

For teaching me to not take life so seriously because at the end of the day, we are all just vessels being used by our genes for survival.
What It Means to Be a Libertarian

For providing me with a baseline value system on being a lowercase ‘l’ libertarian.
Hackers

For teaching me about the mindset of the geniuses that created the computer and internet revolution that we all enjoy today.
Economics in One Lesson

For teaching me the basics of economics (caveat: libertarian spin).
Nothing to Envy

For teaching me that no matter how bad things get here, life is still ‘great’ compared to most other places.
Blood, Sweat and Pixels

For teaching me how brutal the game design industry is and persuading me to switch from Game Design to Computer Science.
The Glass Castle

For teaching me how terrible parents can be and to make sure I never end up like that.
Cracking the Coding Interview

For teaching that getting into top tech companies is doable with hard work and hundreds of hours of practice.
There are No Children Here

For teaching me to have more compassion for people dealt the poverty hand at birth and to accept that no matter how much help they might get, they often can never escape.
Bad Blood

For teaching me how many bullshitters there are in Silicon Valley, and how they operate.
Why Information Grows

For teaching me how to see literally everything in a completely new way (as ways of organizing information).
The Prize

For teaching me how to see the world through the lens of a single hyper valuable resource (in this case: oil).

Fiction

Flowers for Algernon

For teaching me that it’s ok to become stressed and lost when you undergo a lot of change very quickly.
The Fountainhead

For teaching me to find meaning in work, the value of creation, and to not live/create for what society expects of you, but for what you find valuable and worthwhile.
East of Eden

For teaching me that some people are just plain bad, and that’s ok, but more importantly to look for the good in people instead of always searching for the bad.
Angle of Repose

For teaching me to not hold grudges and forgive people.
Shogun

For teaching me that there exist cultures that are so different from what I’m used to, but I can still respect them even if I don’t understand them. Also made me fall in love with Japan.
The Subtle Knife

For teaching me to confront my fears about death.
A Farewell to Arms

For teaching me how painful love can be.
The Lord of the Rings 1-3

For teaching me what a perfect journey and reading experience is.
The Count of Monte Cristo

For teaching me that revenge is often deserved and easy to cheer for.

Implementing Algorand Agreement

2019-01-04T14:05:14+00:00

By: Eric deRegt and Nick Greenquist

Introduction

Bitcoin suffers from several technical problems. It is wasteful with significant energy usage and high fees. There is a high concentration of power as a few mining pools control the flow of money. The miners have low margins and are in known locations, which makes them susceptible to corruption. The ledger has ambiguity because of the possibility of forks as demonstrated by the emergence of Bitcoin Cash. There are real issues with scalability. It is unclear how the system will scale if millions of users are added to the system. Finally, there is a long latency because you have to wait a number of blocks before you can feel confident that your transaction is permanent.

In Algorand ¹, there is a single blockchain. There are no forks or proof of work. This is achieved by the Algorand Agreement protocol, which guarantees agreement and consistency. There are several advantages to Algorand’s approach. The set of possible commands is smaller than Bitcoin which speeds up computation. There is true decentralization as no set of users has exogenous power. Payments are final because the probability of a fork is 1/10^-18. Scalability is bounded by the network latency. Finally, security is achieved against adversaries under extreme conditions.

Algorand selects random users known to the entire world. These users are in charge of proposing the next block and propagate the transactions to a small and random committee. The committee members are selected through a lottery where each member votes for himself. This leads to really fast selections that are random.

GitHub Link

The code for Algorand can be found here: Algorand

Algorand Overview

Algorand ¹ is a cryptocurrency designed to scale to millions of users and confirm transactions in under a minute. Algorand seeks to overcome several challenges: preventing Sybil attacks, scaling to millions of users, and being resilient to denial-of-service attacks and other actions from an adversary that can coordinate with Byzantine nodes. Sybil attacks are combated by giving each user a weight based on their monetary stake in the protocol. If more than ⅔ of the stake is controlled by honest users, Algorand will reach consensus while avoiding forks and double-spending. The protocol produces scalability by the selection of the committee. Rather than every node participating in consensus, a small number of nodes are selected at random. To avoid attacks on committee members are selected privately through sortition and membership changes between rounds and steps.

Each user in the system has a public and private key. A transaction is a payment from one user to another and involves a user signing the message with its private key. As in Bitcoin, these transactions are broadcast to peers through a gossip protocol. Each users picks a small random set of peers to propagate messages to. Upon receiving a message, a peer will check the provided signature is valid before gossiping the message to other users. These transactions are put into a log which form the basis for new blocks. Each user takes these transactions and prepares a block in case they are chosen for block proposal.

Algrorand operates in asynchronous rounds and each round produces a new block which is appended to the blockchain. Each round users are selected to propose blocks and for a committee to reach consensus. These selections are made using an algorithm called cryptographic sortition. Soritition chooses a random set of users based on their weighted stake in the system. Verifiable random functions (VRF) are used to achieve this randomness. VRF takes a public seed and the role (proposer, committee member, etc.) for the sortition and returns a hash and proof. Selected users propagate the return values to peers through the gossip protocol. Other users can use public keys to verify that the hash and proof correspond to a given user. An initial seed is provided to all users and a new one is calculated for subsequent rounds by proposers during the agreement protocol. Since users are selected according to their stakes, a user with a high stake may be chosen multiple times in a given round or step.

Sortition is set above a certain threshold to guarantee with high probability that there will be greater than 1 proposer in each round. Since a block can be very large, users gossip two types of messages - one with just their priority and proof and one with the block. Users will select the highest priority proposer as their leader for the current round.

Algorand originally reached consensus using an algorithm called BA. In the first phase of BA, users reduce block agreement to two options. In the second phase, users agree on a propose block or agree on an empty block. Both of the phases have several steps, where committee members vote for a value and the other users count those votes. BA* completes the first phase in two steps. The second phase is completed in two steps if the proposer is honest and 11 steps with a malicious proposer. It is unclear if BA* was successfully implemented by the authors, but after some time another paper was released [2], which introduced a new consensus algorithm called Algorand Agreement. We are not sure what other changes were made to the protocol in the time since the release of the original Algorand paper, but we chose to use Algorand Agreement in our implementation.

Algorand Agreement Overview

Algorand Agreement ² is a Byzantine agreement protocol that uses leader election and can operate in a partitioned environment. The protocol uses a hash function and digital signature (SIG), which returns the user id, message, and signed message. The signatures are unique for each public and private key pair.

There are a few assumptions for Algorand Agreement. Adversaries can coordinate optimally, but they cannot break the hash function or forge signatures. The set of all players is N, the cardinality of N is n = 3t + 1, and the number of malicious players is t. All players have access to a public random string R, which has been selected randomly and independently of the players’ public keys.

Agreement Protocol

Two communication settings are considered in the paper. In the first, nodes communicate over a synchronous network. Honest users send messages that are received by all other honest users within a given step. All messages seen by a user i before the start of a step are seen by all honest users at the end of the step because i will propagate all messages she has seen. In the second setting, nodes communicate through a propagation network. Nodes have timers with the same speed. The network can be arbitrarily partitioned and the adversary has full control during this time period. Messages are received by honest users within a known time whenever the network is not partitioned. We focused on the second communication setting for our implementation.

There are three types of messages that are sent in the protocol: next votes, soft votes, and cert votes. Additionally, users will send their credential (SIG(R, p)). If multiple nodes propose a block for a given round, a leader is chosen by iterating over SIG(R, p) and choosing the hash with the smallest value among valid participants.

There are five steps in each period p that run sequentially, described as follows.

Communication Setting 2: Steps

Step 1 is Value Proposal, which starts at time 0. Committee members propagate their block value and credential if its the first period or if they have received 2t + 1 next-votes for null in period p - 1. If they have received 2t + 1 next-votes for a value that is not null in p - 1, they propagate that value along with their credential for period p.

Step 2 is called the Filtering Step and takes place at time 2ƛ. In this step, a user i identifies his leader from all nodes that have propagated values and that are verified for that round. If the user has received 2t + 1 next-votes for null in p - 1, he will soft-vote his leader’s proposed value. If he has received 2t + 1 next-votes in p -1 for a value that is not null then he will soft-vote that value.

Step 3 is the Certifying Step and runs for clock times 2ƛ-4ƛ. If a user sees 2t + 1 non-null soft-votes for v that user cert-votes v.

Step 4 is the First Finishing Step at time 4ƛ. If a user has certified a value for period p, she next-votes v. If she has seen 2t + 1 next-votes for null in period p - 1, she next-votes null. Otherwise, she next-votes her starting value.

Step 5, which users enter after 4ƛ, is the Second FInishing Step, which she stays in until she can finish the period. If she sees 2t + 1 soft-votes for a non-null value, she will next-vote that value. If she sees 2t + 1 next-votes for null in p - 1 and has not certified in p, then she next-votes null.

These periods continue until the Halting Condition is reached. The Halting Condition is checked any time a cert-vote is received or cast. If a user sees 2t + 1 cert-votes for a value v, they append that value to their blockchain and move to the next round. These cert-votes can be from any period as nodes cannot ever change what value they will cert-vote once casting this type of vote.

Implementation

Our goal for the project was to create a working implementation of Algorand based on algorithms described in the Algorand¹ and Agreement² papers. We used the details from the Algorand¹ paper to construct our overall structure and our algorithms for sortition, gossiping, and block proposal. We used Algorand Agreement² for consensus and used Communication Setting 2 as described above.

Assumptions

We made a number of assumptions. Honest nodes are required to have at least 2t + 1 stake in the system. Nodes cannot lie about their userId or spoof messages. Nodes cannot change the result of sortition or verifySort. Timers are not synched, but they move at the same speed. Our sortition always selects two users, instead of implementing a probabilistic function with a targeted number of selected users. All users use the same sha256 hash function when using blocks and signatures. Every RPC message makes it to honest users. There are no retries for votes or propagate block messages.

Architecture

We organized our project in a similar manner to Lab 2: Raft. Each server runs identical code found in serve.go. Each node also initiates itself using the code in main.go. This code is responsible for gathering all the peers in the system, generating the node’s genesis block, connecting each server to their respective bcStore, and then starting the server.

bcStore stores the blockchain and serves as the connection between the server, client, and the blockchain itself.

In order to simulate a ‘user’ adding transactions to the blockchain, we seperated the ‘add transaction’ request functionality to client code. The client has a one-to-one relationship with a single server. The client can send transaction requests to a server through it’s port, and also get the current blockchain back by another request.

In addition to the main client, server, and bcStore relationship, there are a few other files that help keep the logic organized. All of the code for handling the 5 steps and the halting condition is found in agreement.go. All helper functions such as preparing block objects, hashing values, signing messages, creating SIG structs, generating the committee from sortition, verifying sortition, selecting the minimum leader hash, and initializing initial stake, are all stored in utils.go.

Details

All nodes who join the network create a bcStore which keeps a channel of commands they receive from clients and a blockchain data structure composed of blocks. Nodes connect to other known peers. The Algorand server runs on a goroutine and listens for several types of messages that can trigger different parts of the protocol and sends messages to other nodes.

Nodes keep track of several pieces of internal state. Each node keeps track of its private and public keys, what round, period, and step they are in, and their temporary and proposed blocks. Additionally, there is state for the periods in the agreement protocol. PeriodState includes values that have been proposed and who proposed them and all of the next-votes, cert-votes, and soft-votes used in agreement. We keep one state object for the current period and one for the previous period so that we can monitor the number of votes and appropriately terminate the agreement protocol when it is safe to add a block to our blockchain.

There are four RPC calls in our implementation - AppendTransaction, ProposeBlock, Vote, and RequestBlockChain. In order to pass messages between nodes using the four RPCs, we needed to create channels for both the RPC arguments and responses.

AppendTransactions are handled easily, where nodes simply return true to the broadcaster. Nodes also immediately respond true to the client after receiving a request to add a transaction. We left the client responsible to retry transactions if they later see that their requests did not make it into the updated blockchain. To enable this, clients can send a GET command to their server to receive the server’s current blockchain.

For ProposeBlock RPC, we also set up a channel to listen for any ProposeBlock messages. Whenever a message is received, the receiver first checks if the sender is approved for this round by calling verifySort. If the sender is approved, the receiver adds the proposed value and block to the their internal state that maps proposer credentials to their proposed value. Nodes always return true as long as they receive this message.

Handling incoming Vote messages requires a bit more logic. On top of extracting the correct type, nodes have to update their PeriodState with vote counts for specific values after checking that a sender has not already voted for that period.

RequestBlockChain requests are handled by simply passing the entire BlockChain into the response channel. The requester will then verify the returned blockchains in the response channel listener in order to decide if they need to replace their blockchain and update their current round and state.

In between rounds, each user keeps track of a TemporaryBlock, which they will use if selected for the committee. When a node receives a transaction from a client, it propagates that transaction to all other nodes through AppendTransaction. The nodes that receive these messages will add these transactions to their respective TemporaryBlocks. When we enter a new round, a proposer will compare the last block in their blockchain with their proposed block and remove any duplicate transactions before proposing a new block.

We created two timers to deal with separate problems: a RoundTimer and an AgreementTimer. The RoundTimer signals when users should enter into the first step of the agreement protocol for the next round. At this point, users will check if they have been selected by sortition to propose a block, and if so, they generate their credentials and broadcast a ProposeBlock message. ProposeBlock takes a block, credential, value, round, and peer as input and is sent to all peers. The value in this message is simply the hash of the block’s object byte memory.

The AgreementTimer signals when users should proceed between steps. It will first go off at the beginning of step 2. At each step we call a step function, which takes in the current PeriodState, last PeriodState, and required number of votes. The return value is either a value to propagate as Vote message, or a null value which is not acted on. Vote values are propagated using the Vote RPC. Vote takes the type of vote, round, and period as inputs and returns a boolean for success. Our Vote channel updates our vote state for the current and last period, keeping track of who has voted for what at each round and period. In the cert-vote branch, we also check for the halting condition. If we reach the halting condition, agreement has been reached and the block is appended to the node’s blockchain and the state is updated for the next round.

The final thing we considered was how to catch users up to new blocks. When a user receives a block proposal or vote, they can check if the sender is in a later round. If they are, the node may be in a state that is not up to date. This could occur if a node joins the network after it initially starts or if it fails and comes back online. In this case, the node will call the RequestBlockChain RPC to start the process of catching up to the current block so that they can rejoin the consensus process. Upon receiving a RequestBlockChainChan message, a node will respond by sending the current version of its blockchain. The node that is behind checks for the longest blockchain they have seen and verifies that the all blocks are valid by using verifySort on each block’s stamped credential.

Tradeoffs

We made a number of design decisions in our implementation that were required due to nonexistent or vague details in the papers. For the design decisions that were not described in great detail in the papers, we tried our best to look at other blockchains and come to a reasonable decision. A couple examples of this were introducing the round timer to kickstart agreement and adding the process for nodes to the catch up to the current round if they see they are behind.

We also had to simplify a couple parts of the system to narrow the scope of our implementation. The first area we simplified was the use of VRFs. Instead of using VRF for sortition, we used the round as a public seed and used a shuffle function we wrote to select users. Each user selects the same random index from a sorted list of userId’s using the same random seed to ensure equal committees across all nodes. We also didn’t use VRF for verifySort. Instead we used the same committee selection algorithm to verify the users had been selected. We assumed that stake does not change, but that it is randomly set at the beginning using peerId as seed. Two users are always selected by sortition for the block proposal stage. Finally, in our proposal we planned on implementing smart contracts on top of Algorand. We didn’t have enough time to implement smart contracts.

Challenges

There were several challenges we faced. Most were the result of the paper leaving out many implementation details.

Our biggest challenge was dealing with new users. We define new users as nodes that decided to join the network late or nodes that had recovered after failing. We decided that a new node should request the blockchain if they find out they are behind on the current round. This is easy to check as we send the round with each ProposeBlock and Vote message. We implemented a correct feature where users request, collect, and verify blockchains if they discover they are behind the current round. However, the real difficulty was getting the user to collect all ProposeBlock and Vote messages they missed out on, and also catch up to the correct period and step as the rest of the honest nodes.

Another challenge was in trying to implement VRF. It quickly became clear that it was not feasible to implement this function on our own. There were a number of cryptographic primitives that we were not well versed in and we didn’t feel we had enough time to implement a robust version of this algorithm. We looked at using the LibSodium C library used by the Algorand team, but ran out of time.

The original paper used a different agreement algorithm called BA*. We had spent some time understanding and working through implementation ideas for this protocol before learning that the Algorand team had released a new Algorant Agreement protocol. We decided that it made sense to use the newer protocol. However, it took us a while to understand the new algorithm and how to connect it to the main paper.

Testing

We tested our implementation in several ways. Initially, we started off by just trying to get a blockchain that sent transactions to peers, gathered blocks, and added blocks to the blockchain without any consensus mechanism. After this, we implemented Algorand Agreement and tested our implementation on four nodes manually using a client package similar to the one from Lab 2. We then ported the launch tool from Lab 2 and used this to test the system on a greater number of nodes with more transactions and allowing for node failures.

Correctness was tested using multiple nodes running at the same time. We sent transactions on different threads to multiple nodes that were running the Algorand server. We checked that the blockchain contained the same blocks after several rounds of the protocol. Blocks were analyzed to make sure hashes and transactions matched across nodes. Additionally, we tested with Byzantine behaviour by making certain peers always behave in a certain way. For example, with 4 nodes, we set peer0 to always propose their own block, only vote for their own values, and ignore all incoming messages from other nodes. The protocol was able to continue safely with the remaining 3 nodes despite peer0 trying to cause mayhem.

Liveness was another goal we tested for. We found that as long as 2t + 1 nodes are up at all times, Agreement rounds terminate eventually.

Performance was found to be very fast. Although we did not have the chance to test our implementation on hundreds of thousands of nodes like the authors of the Algorand paper did, we found that Agreement consistently completed very fast even when testing on Kubernetes using the launch tool with a few dozen nodes.

Conclusion

Algorand promises an extremely fast consensus protocol that would allow for a massively scalable and partition resilient cryptocurrency. Through implementing Algorand Agreement², we were pleasantly surprised to find that the basic protocol is correct in a stable and small network of peers. However, the Algorand papers are severely lacking in numerous implementation details that are needed for even a minimum viable product with Algorand. The authors of the Algorand papers promise to make the Algorand code open source. However, only the VRF function has been released as of today. We hope the authors release more of the code in the future and uncover the missing pieces needed to make Algorand practical and useful in real use cases.

References

Yossi Gilad, Rotem Hemo, Silvio Micali, Georgios Vlachos, Nickolai Zeldovich. Algorand: Scaling Byzantine Agreements for Cryptocurrenices, https://people.csail.mit.edu/nickolai/papers/gilad-algorand-eprint.pdf ↩ ↩² ↩³ ↩⁴
Jing Chen, Sergey Gorbunov, Silvio Micali, Georgios Vlachos. Algorand Agreement: Super Fast and Partition Resilient Byzantine Agreement, https://eprint.iacr.org/2018/377.pdf ↩ ↩² ↩³ ↩⁴

GPU Accelerated Matrix Factorization for Recommender Systems

2019-01-02T14:05:14+00:00

By: Nick Greenquist and Doruk Kilitcioglu

Introduction

Matrix Factorization (MF) is a popular algorithm used to power many recommender systems. Efficient and scalable MF algorithms are essential in order to train on the massive datasets that large scale recommender systems utilize. Graphics Processing Unit (GPU) technology has become very popular in recent years and has become widely used in machine learning. The massive parallelism GPUs offer creates an opportunity to develop an accelerated MF algorithm. This blog post presents cu2rec, a matrix factorization algorithm written in CUDA. cu2rec implements a parallel version of Stochastic Gradient Descent (SGD) to solve large scale MF problems. cu2rec utilizes multiple advanced techniques to harness better performance from a GPU. These include aggressive use of constant memory for hyper parameters and registers for heavily reused values, a sparse matrix data structure, a reduction sum total loss kernel, parallel lock-free updating of feature weights with minimized global memory writes, and fairness across weight updates using user index striding. With a single NVIDIA GPU, cu2rec can be 10x times faster than state of the art sequential algorithms while reaching similar error metrics.

The code and the instructions on how to run it can be found on the GitHub page.

How It Works

In this section of this post, we will explain how what GPUs are, how they work, and how they can be used to accelerate a powerful algorithm commonly used for recommender systems: matrix factorization.

How do GPUs Work

Today, GPUs are being used more to more to parallelize algorithms that need to run on big data. GPUs have much more compute power than CPUs and work well performing the same instruction on multiple pieces of data (SIMD)¹. GPUs originally were used for heavy graphical work but have become a staple of large machine learning models². Because of the explosion of GPU usage across much of machine learning, we wanted to see if they could be useful in the sub-field of recommender systems.

GPUs are most useful when the task is computationally intensive, there are many independent computations, and the computations are similar. The architectural paradigm that most benefits from this is Single Instruction Multiple Data (SIMD), where a single instruction is executed for multiple data points.

GPUs implement this SIMD architecture, but they do not devote the whole GPU to that instruction. Rather, only 32 execution units (out of 5000 in recent GPUs) have to execute the same instruction. The best practice is to structure your code to be as much SIMD as possible.

CUDA

CUDA is the programming language built by NVIDIA as a general purpose way of interacting with NVIDIA GPUs. It can be used with both C and C++, and gives the user API calls to important GPU functions such as copying data from and to GPU memory, and launching GPU functions called kernels.

Each kernel defines a function that is called per execution unit, called a thread. These threads are bundled into 32 units, called a warp. Each thread in a kernel executes the same function, but only these warps need to be simultaneously executing the same instruction. Furthermore, not all threads need to run concurrently, but they all need to finish before a kernel is done.

For more information about CUDA, you can check Kirk et al¹.

Matrix Factorization for Recommender Systems

Matrix Factorization decomposes a rating matrix $R$ of shape $m\times n$ into two feature matrices, $P$ and $Q$. $P$ ($m\times f$) and $Q$ ($n\times f$), where $f$ is the number of factors, are the learned feature matrices. The goal of MF is to train the best $P$ and $Q$ matrices such that $R=P\times Q$, as shown in the figure below. In state of the art models, it also involves learning $U$ ($m\times 1$ matrix of user biases) and $I$ ($n\times 1$ matrix of item biases), with an additional global bias $\mu$ which is set to the mean of all ratings.

MF is incredibly popular and has been used for serving recommendations by companies such as Amazon and Netflix³. In the era of big data, the datasets that MF is being asked to tackle are getting orders of magnitude larger than even the Netflix Dataset⁴. In order to keep up, MF algorithms need to be fast, scalable, and able to handle massive amounts of data.

For more information, please look at these blog posts for a technical explanation of Matrix Factorization and an application of Matrix Factorization for book recommendation.

SGD for Matrix Factorization

For the task of rating prediction, we use the biased SVD popularized by Koren et al.⁵. For each user $u$ and item $i$ pair, the estimated rating $\hat r_{ui}$ and the error for that rating $e_{ui}$ are defined by:

\[\begin{eqnarray} \hat{r}_{ui} &=& \mu + b_u + b_i + q^T_ip_u\\ e_{ui} &=& r_{ui} - \hat{r}_{ui} \end{eqnarray}\]

where $b_u$ is the user bias (which is a single float), $b_i$ is the item bias (also a single float), $\mu$ is the global mean, and $p_u$ and $q_i$ are the user and item vectors in $P$ and $Q$ respectively.

Our total loss function is the mean squared error of ratings, plus the regularization terms for the 4 matrices, which are the Frobenius norms of the matrices:

\[\begin{equation} L = \Sigma_{i=1}^{n}{(r_i - \hat{r_i})^2} + \lambda_p {\lVert}P{\rVert}_2^2 + \lambda_q {\lVert}Q{\rVert}_2^2 + \lambda_{u} {\lVert}U{\rVert}_2^2 + \lambda_{i} {\lVert}I{\rVert}_2^2 \end{equation}\]

For SGD, we use the loss per item:

\[\begin{equation} L_{ui} = e_{ui}^2 + \lambda_pp_u + \lambda_qq_i + \lambda_uU_u + \lambda_iI_i \end{equation}\]

As with regular gradient descent, the update equation of any parameter $x$ with learning rate $\eta$ is defined by

\[\begin{equation} x \leftarrow x - \eta\frac{\partial L}{\partial x} \end{equation}\]

In our case, the parameters are each and every element of of the $P$, $Q$, $U$, and $I$ matrices. Below are the respective partial derivatives that are used with the update equation above:

\[\begin{eqnarray} \frac{\partial L_{ui}}{\partial p_u} &=& -(e_{ui}q_i - \lambda_pp_u)\\ \frac{\partial L_{ui}}{\partial q_i} &=& -(e_{ui}p_u - \lambda_qq_i)\\ \frac{\partial L_{ui}}{\partial b_u} &=& -(e_{ui} - \lambda_ub_u)\\ \frac{\partial L_{ui}}{\partial b_i} &=& -(e_{ui} - \lambda_ii_i) \end{eqnarray}\]

Parallel SGD Challenges

Parallel SGD is a hotly debated topic, but it is definitely being used all around the globe. Tan et al.⁶ claim that each iteration of SGD is inherently sequential, whereas Recht et al.⁷ and Yun et al.⁸ argue that in the context of sparse matrix factorization in recommender systems, multiple SGD updates can be carried out simultaneously in parallel.

In our implementation, we utilized SGD similar to the ideas outlined in Recht et al⁷, but modified it to be used with GPUs. Because the matrices we are dealing with are very sparse, we were able to parallelize SGD updates across multiple users and items. We chose to parallelize over users, which in reasonable-sized datasets such as ML-20M, is much higher than the number of CUDA cores in modern GPUs. As a result, we can utilize our GPUs by choosing a random item per user to update on.

We faced a few issues with this approach specific to GPUs, which warranted us to create our own flavor of the parallel SGD algorithm (see algorithm 1). The first issue we had was with multiple updates to the same item. We were running atomic operations for the updates to the item matrix, which meant that our code was slow, and worse, there were too many updates on the most popular items. This led to an overly drastic change for the item vectors, making it very hard to balance.

That is when we decided to get rid of the atomicity and have a race condition on the SGD updates, making it so that only one update per item would stick, and the previous update attempts would be void. We called this version of the algorithm Early Bird Loses the Gradient.

After analyzing our kernel, we noticed that the users with higher user ids had lower error than users with lower user ids. This was the result of how blocks are scheduled in CUDA: blocks with lower block ids (which have the lower user ids) get scheduled earlier than higher block ids. As a result, a lot of the item updates that were done by earlier users were being overwritten, and the algorithm was overfitting to the later users.

To combat this issue, we introduced user id striding with each iteration. Each thread handles user $tid + (stride * iters)\ mod\ N$, where $tid$ is the thread id and $N$ is the total number of users. This results in fair item updates with respect to the users.

Another improvement came from not allowing multiple updates to the same item, because that results in more global memory accesses. For that, we added a binary array of all items, and whenever an item is updated, we set its updated value to true. Note that this is done with regular checks, not any costly $atomicCAS$, because it is not a problem if there is a race condition in a warp and it gets overwritten a couple of times. It still is empirically faster, and we call this version of the algorithm Early Bird Gets the Gradient (EBGTG).

Also worth mentioning is that we don’t need to compute the total loss every iteration. It is an expensive operation, and we only do it every couple of iterations to modify the learning rate if the SGD has stopped learning. Each thread in the SGD kernel just calculates the error for the selected user-item pair, resulting in a heavy speedup compared to calculating the error on the whole training set.

Implementation: cu2rec

In this section of the post, we will explain how we built cu2rec. We explain how to load in ratings data to train on, the important code that performs the matrix factorization and optimization, and also some advanced programming techniques we used to make it cu2rec fast and accurate.

Data Preparation

Sparse Matrix Representation

For even decently large datasets, the full ratings matrix $R$ is too big to fit in memory. This is doubly true when we are dealing with GPU memory, as we cannot physically add more memory to a GPU.

In order to represent the matrix, we use the Compressed Sparse Row (CSR) format. CSR matrices are defined by three arrays: $indptr$, $indices$, and $data$. The item indices for user $u$ are stored in $indices[indptr[u]:indptr[u+1]]$ and their corresponding ratings are stored in $data[indptr[u]:indptr[u+1]]$. This allows for efficient indexing into the matrix given a user, which plays into how we construct our kernel.

When representing users with no ratings, we repeat the same value in the $indptr$ array, and handle that as a special case in our kernel.

Input Data

Our program accepts CSV files that are formatted as $user\_id,item\_id,ratings$ tuples, wherein both $user\_id$ and $item\_id$ are sequential numerical ids starting from 1, and are sorted based on $user\_id$. For convenience, we provide a script that convert non-sequential ids to sequential ids, and a script that sorts the tuples based on $user\_id$.

The program also needs two different files, one as a training set with which the MF model is trained, and a test set with which the model is evaluated. We use a completely random split across all tuples for generating these files, and split the data into 80% training and 20% test set. We also provide a script to split the data.

Overview

The cu2rec code is organized into 3 main parts: reading in data into sparse matrix form and moving it to the GPU, using SGD and Loss kernels to train the MF model, and moving data back to the host in order to write the trained components to files.

When a user runs cu2rec, they need to supply the train and test CSV files. cu2rec starts by reading both files into a vector of Rating structs that live in the host memory. It then converts each vector of Ratings into a CSR representation. Next, it moves these matrices into the GPU’s memory. Once the memory allocation is complete, cu2rec then loads in the hyperparameters from a config file into constant memory.

Next, cu2rec calls a $train$ function that is responsible for using the sparse rating matrices to train a MF model. $train$ starts by initializing all of the components randomly using a normal distribution with mean of 0 and standard deviation of 1. For the $P$ and $Q$ matrices, these values are normalized by the number of factors the model will use. Next, all of these components are moved to the GPU’s global memory using standard CUDA API functions. Each is wrapped in a helper function to check for CUDA errors. In addition to moving needed memory to the GPU, cu2rec also initializes a CUDA Random object and other needed variables to handle an adaptive learning rate. Finally, the iterative training of the model can begin.

The main training loop in cu2rec does $totalIterations$ loops over two main steps: update components using SGD and computer total losses. cu2rec parallelizes over the number of users in the matrix for both steps. At the beginning of each iteration, the SGD Kernel is called (algorithm 1). The code for the SGD Kernel is below.

1) SGD Kernel

In order for SGD to compute updates to the feature matrices, we needed to create a function that kernels can use to compute a predicted rating using the current components. The code for the Prediction Kernel is below.

2) Prediction Kernel

After running SGD for one item for every user, the algorithm checks if it is time to test the updated model on the test ratings (algorithm 3). This is only done periodically, as computing the total losses is expensive because it needs to compute the error on all ratings, not just one rating per user such as in SGD. To do this, cu2rec first uses the Loss Kernel to compute the loss on every rating.

3) Loss Kernel

Next, cu2rec uses the Total Loss Kernel to reduce all the individual losses so RMSE and MAE can be computed.

4) Total Loss Kernel

After computing the new error metrics, cu2rec checks to see if the learning rate should be lowered by checking a patience counter. Finally, cu2rec swaps the pointers of the updated $Q$ and $itemBias$ components because we want to use the new values for the next round of updating. We do not need to keep copies of $P$ and $userBias$ matrices as the updates are simply done on the original matrices. We only keep a target version of $Q$ and $itemBias$ as updates will have race conditions between threads. More is discussed about this in the next section, Early Bird Gets the Gradient.

Once $totalIterations$ of the training loop are complete, it is time to save the trained model and free all necessary memory. First, all the trained components are copied back to the host variables. Next, the CUDA variables are freed along with all host variables that are not part of the trained model. Outside of the $train$ method, the main code is responsible for writing to file all necessary components of the model that can be used to serve recommendations. The necessary components to write to file are: $P$, $Q$, $userBias$, $itemBias$, and $globalBias$. Once all have been written to a file, cu2rec’s final steps are to free those variables memories on the host and terminate the program.

Early Bird Gets the Gradient

Due to the sequential nature of SGD, we had to come up with a technique to implement SGD with multiple threads potentially trying to update the features of the same item. In SGD, a single error is computed for a rating and both the user corresponding to that rating and the item have their feature matrices and bias weights updated. Because cu2rec parallelizes over users and each user picks a random item they have rated, the same item (especially popular items) is highly likely to be picked by multiple threads. The chance of this becomes near certain with non-trivial amounts of users and is guaranteed if you have more users than items due to the Pigeon Hole principle.

In order to handle race conditions on the same memory, we created the Early Bird Gets the Gradient technique. In the SGD Kernel (algorithm 1), each thread picks a random item from the user’s rated items. Then, it computes the error using the current feature matrices and takes the difference with the true rating. Next, it uses this error to update all of the features. Updating the $P$ matrix and $userBias$ values will never have write conflicts since each thread is responsible for a single user. However, updates to values in $Q$ and $itemBias$ matrices requires special care.

In EBGTG, the first thread to pick an item wins the race to write to that item. To implement this, we created a new boolean array, $itemIsUpdated$, that stores a true or false value for if an item’s features have already been updated. When a thread selects a random item, they check the value in the array and set a local $earlyBird$ variable to $true$ if they ‘won the race.’ They then set the value for this item to $true$ so other threads will set their $earlyBird$ variables to $false$. Threads then only waste time doing $f$ global memory writes to $Q$ and one global memory write to $itemBias[y]$.

However, it should be noted that due to warp divergence, even if a single thread in a warp is selected as the early bird, all threads in the warp are blocked until the early bird thread is finished updating $QTrgt$. Warp divergence is a dangerous occurrence in GPUs when branch conditions are introduced into kernels. In a GPU, threads are bundled into a group of threads, usually 32, that is called a warp. Every thread in a warp executes each instruction of a kernel in lockstep. When a conditional is reached, the GPU has no choice but to block all threads that fail that conditional while all the threads that pass it perform all the instructions in that block of code. Then, when the conditional ends, all the threads begin again performing the remaining instructions in the kernel. Therefore, with EBGTG, if a single thread in a warp is the ‘early bird’, all other threads will be blocked until that single thread does the slow global memory writes.

We were worried at first that warp divergence would prevent any speedup from EBGTG versus Early Bird Loses the Gradient, but in empirical testing, we saw a consistent 12-15% speedup for the entire program.

Advanced Techniques

To optimize run-time and ensure best in class results on test sets, cu2rec utilizes multiple advanced techniques. The use of constant memory, registers, and an optimized reduction sum kernel use the GPU’s architecture to greater advantage. Our parallel lock-free updates and user index striding ensure we achieve the best results possible against test set ratings.

Constant Memory

All of our kernels rely on many of the same static variables, such as hyper parameters and dimensions. These values include total iterations, number of factors, $\eta$, $\lambda_p$, $\lambda_q$, $\lambda_u$, and $\lambda_i$.

The constant memory is useful because from the kernel’s point of view, it never changes, and can therefore be aggressively cached. Because all of the values we store in constant memory never change during the execution of cu2rec, all of these values get cached in the beginning of execution and remain in the cache for the runtime of the program.

Aggressive Register Use

Registers are the fastest memory available on the GPU and accesses to this memory can be over 100x faster than global memory¹. As such, one of our goals was to aggressively use registers to store any values that are used more than once in a kernel. We will break down which values we decided to store in registers.

In the SGD Kernel (algorithm 1), the following variables are created to store values in registers: $x$ (the user id), $low$ (the first rated item id index pointer), $high$ (the last rated item id index pointer), $yi$ (random item index pointer), $y$ (the item id), $ub$ (user $x$’s bias), $ib$ (item $y$’s bias), $error$ (the error of the model on the rating of user x on item y), $earlyBird$ (if the user is first to update item’s features), $pOld$ (old value for feature matrix $P$ at row $x$ and column $i$), and $qOld$ (old value for feature matrix $Q$ at row $y$ and column $i$).

In the Loss Kernel (algorithm 3), the following variables are created to store values in registers: $x$ (the user id), $ub$ (user $x$’s bias), and $itemId$ (the item index for every item $x$ has rated).

We experimented with having less register use, thinking it might allow for more blocks to be scheduled at the same time, but that resulted in more global memory accesses (and more pressure on the caches), and was overall empirically slower.

Reduction Sum

The total loss kernel, which is used to calculate the global loss after a set number of updates, uses a fixed number of threads $t_g$ per grid regardless of the number of ratings. This makes it easy to scale to very high number of ratings ($N$). It uses the reduction sum technique, wherein each thread at each step calculates a partial sum of its previous partial sum and the next thread’s previous partial sum. This is done for $log(N)$ steps.

Each thread initially calculates the sum of $N/t$ elements sequentially, where each element is $t_g$ apart from the previous one, allowing coalesced memory accesses. These sums are written into the shared memory.

After this step, each block (size $t_b$) reduces its own sum into the sum at $tid=0$. This is done for $log(N)$ steps, wherein in the first step, the first $t_b/2$ threads calculate their partial some with the sums in $tid$ and $tid + t_b/2$, and then $t_b/4$ threads, and so on. This approach coalesces the memory accesses and minimizes branch divergence. We also use loop unrolling to reduce the number of if statements, and also use templating with the block size to get rid of the unnecessary unrolled checks in compile-time.

At the end, each thread with $tid=0$ writes its own sum to the global memory, so we get a partial sum per block. These can be fed into another reduce sum kernel, but we found doing the final addition on the host was faster.

Parallel Lock-Free Updates

As discussed in the section explaining Early Bird Gets the Gradient, cu2rec does not lock any memory address writes while updating feature matrix weights. Earlier versions of cu2rec utilized atomic operations to update the $Q$ and $itemBias$ matrices. However, after discovering that optimal results could be achieved by simply letting a single thread ‘win’ an update, we decided to remove any atomic operations from any kernel.

User Index Striding

EBGTG favors whichever thread selects an item first and sets the $itemIdUpdated$ value to $true$. Therefore, we must ensure that every thread (and therefore every user) gets a fair shot to be the ‘early bird.’ At first, with naive EBGTG, we were seeing 2-3% worse results on test data than other implementations. This gap in results increased for larger datasets, including almost 8% worse results on the Netflix Dataset when compared to best in class results (Xie et al.⁹). We learned that EBGTG was scheduling lower index users first since they would always be in the first few blocks every iteration. While there is no guarantee which blocks begin running first, there are limited SMs on a GPU and therefore some blocks are queued while waiting to run. We found that later users would consistently be scheduled to higher block indexes, and thus run last.

In order to combat this, we decided to offset which user each thread uses to perform SGD. In the training loop, we add an offset variable every iteration and pass that to the SGD kernel (algorithm 1). This effect can be seen in the first line where $x$ is computed. What this offset does is ensure that over time, every thread will be responsible for different sections of the user matrix.

By implementing this striding, we immediately began to see equal results in test set error metrics compared to other well known implementations.

Results

We benchmarked our GPU code using an NVIDIA V100 GPU, and the CPU code with an Intel i7-6650U CPU.

Speedup

RMSE

GPU vs GPU

We tested our code both with a V100 on NYU’s Prince server, and a TITAN Z in NYU’s cuda2 server. We varied problem sizes, the memory bandwidth requirements, and the number of iterations to get a healthy comparison. This is visualized in the following figure:

The V100 has 5,120 CUDA cores vs 5,760 of TITAN Z, 900GB/s memory bandwidth vs 672GB/s of TITAN Z, and higher clock speed.

In terms of the cost in performance, going from the ML-100k to ML-20M dataset increases the total problem size, as there are more users, items, and ratings to train for. Holding all other things equal, we would expect a scalable code to have higher speedup with larger problem size, and our speedup satisfies this.

In terms of bandwidth requirement, increasing the number of factors from 50 to 300 has a direct effect on it, because the SGD kernel needs to retrieve more data from the global memory. Holding all other things equal, we would expect a scalable code to have higher speedup with increased memory bandwidth, and our speedup satisfies this.

Conclusion

GPU technology opens the door to massive acceleration for a wide variety of problems. Machine Learning has recently seen a massive boost in effectiveness from both powerful GPUs and the availability of big data to train complex models on. Recommender Systems are an interesting subset of machine learning as they benefit greatly from larger and larger datasets that allow models to uncover complex latent relationships between users and items. SGD is one of the most popular algorithms to optimize recommender system MF models. However, SGD provides a challenging problem for GPU implementations due to its inherent sequential definition.

This blog post introduced cu2rec, a novel parallel implementation of SGD used to solve recommender system matrix factorization. Through the use of a parallel lock free SGD kernel and a variety of advanced CUDA programming techniques, cu2rec achieves a 10x speedup over one of the fastest sequential recommender system MF libraries while matching best in class error metrics. In addition to outperforming sequential implementations of SGD MF, cu2rec has also been shown to scale with better GPU hardware.

References

D. B. Kirk and W. H. Wen-Mei, Programming massively parallel processors: a hands-on approach. Morgan kaufmann, 2016. ↩ ↩² ↩³
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, “Deep learning with cots hpc systems,” in International Conference on Machine Learning, 2013, pp. 1337–1345. ↩
A. Bari, M. Chaouchi, and T. Jung, Predictive analytics for dummies. John Wiley & Sons, 2016. ↩
Y. Koren, “The bellkor solution to the netflix grand prize,” Netflix prize documentation, vol. 81, pp. 1–10, 2009 ↩
Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, no. 8, pp. 30–37, 2009. ↩
W. Tan, L. Cao, and L. L. Fong, “Faster and cheaper: Parallelizing large-scale matrix factorization on gpus,” CoRR, vol. abs/1603.03820, 2016. [Online]. Available: http://arxiv.org/abs/1603.03820 ↩
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in neural information processing systems, 2011, pp. 693–701. ↩ ↩²
H. Yun, H. Yu, C. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon, “NOMAD: non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion,” CoRR, vol. abs/1312.0193, 2013. [Online]. Available: http://arxiv.org/abs/1312.0193 ↩
X. Xie, W. Tan, L. L. Fong, and Y. Liang, “Cumf sgd: Fast and scalable matrix factorization,” CoRR, vol. abs/1610.05838, 2016. [Online]. Available: http://arxiv.org/abs/1610.05838 ↩

Concurrency With Go

2018-09-10T14:05:14+00:00

How to Write Concurrent Function Calls in Go

Go is an amazing language. I’d know, I just started using it this weekend. What are some of the cool things you can do with it? Well, how about writing concurrent function calls trivially! Let’s take a look.

Simple Sequential Function Calls

This code below is pretty simple. It just calls a function 20 times and prints i, from 0 -> 19

package main

import (
	"fmt"
)

func print(i int) {
	fmt.Println(i)
}

func main() {
	for i := 0; i < 20; i++ {
		print(i)
	}
}

Output:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Simple Concurrent Function Calls

This code below attempts to fire off 20 concurrent function calls. We use the go keyword before the function call to tell Go to run this in its own thread. Let’s see what happens if we run this code.

package main

import (
	"fmt"
)

func print(i int) {
	fmt.Println(i)
}

func main() {
	for i := 0; i < 20; i++ {
		go print(i)
	}
}

Output:

That’s right: nothing was printed! Why did this happen? Well, the for loop fires off 20 concurrent routines. Those routines are off running on their own. main() then continues running. What is after the for loop? Well, nothing. So main() terminates. Those routines that split off never had a chance to even print anything! So how can we fix this? Let’s take a look at one approach using channels.

Concurrent Function Calls with Channel

We can use a channel in Go which is pretty much a semaphore. A channel is a buffer that can hold n ‘things’. For this case, we just want to add an int simply to signal something is inside. The type doesn’t matter for what we need.

package main

import (
	"fmt"
)

func print(i int, c chan int) {
	fmt.Println(i)

	// this signals to the channel this routine is done
	c <- 1
}

func main() {
	num_calls := 20
	c := make(chan int, num_calls)

	for i := 0; i < num_calls; i++ {
		go print(i, c)
	}

	// this loop won't terminate until 20 ints have been popped out
	for i := 0; i < num_calls; i++ {
		<- c
	}
}

Output:

2 0 1 10 5 6 7 8 9 14 11 12 13 3 16 15 19 18 17 4

Now this looks concurrent!!

So, what happened can be boiled down to this. We set up a channel of size 20. We loop 20 times and call 20 go routines. Next, main() enters a for loop. Each call is trying to pop out something from the channel. That for loop won’t finish until 20 things have some out. If the buffer is empty, Go just waits around until something is added. What is adding things into the buffer? You guessed it! Our routines! So, each time a routine is done, it adds something to the channel, letting main() inch closer to termination.

Concurrent Function Calls with WaitGroup

Here is another, more ‘Production Code Approved’ way of waiting for all go routines to finish running. It involves the use of a WaitGroup.

package main

import (
	"fmt"
	"sync"
)

func print(i int, wg *sync.WaitGroup) {
	// defer means run this line right before exiting the function
	// wg.Done() signals to WaitGroup that this routine is done
	defer wg.Done()
	
	fmt.Println(i)
}

func main() {
	wg := &sync.WaitGroup{}

	for i := 0; i < 20; i++ {
		wg.Add(1)
		go print(i, wg)
	}

	wg.Wait()
}

Output:

0 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Building Books2Rec - A Book Recommender System

2018-06-08T12:05:14+00:00

Books2Rec is a book recommender system that started as a project for the Big Data Science class at NYU. Using your Goodreads profile, Books2Rec uses Machine Learning methods to provide you with highly personalized book recommendations. Don’t have a Goodreads profile? We’ve got you covered - just search for your favorite book.

Check it out here!

Introduction
How it Works
Project Structure
Creators
Acknowledgements
References

Introduction

Recommender systems is at the forefront of the ways in which content-serving websites like Facebook, Amazon, Spotify, etc. interact with its users. It is said that 35% of Amazon.com’s revenue is generated by its recommendation engine^[1]. Given this climate, it is paramount that websites aim to serve the best personalized content possible.

As a trio of book lovers, we looked at Goodreads, the world’s largest site for readers and book recommendations. It is owned by Amazon, which itself has a stellar recommendation engine. However, we found that their recommendations leave a lot to be desired.

Here is an example of Goodreads recommending a book about the difficult trek to the Western fronteir of the US based on my high rating of the sequel to Charlie and the Chocolate Factory. I think we can do better. Example of an unrelated recommendation by Goodreads.

Below, we are using a hybrid recommender system in order to provide recommendations for Goodreads users (ratings and item features). Example of our recommendations based on our hybrid model..

We also provide more ‘traditional’ recommendations that only use the book’s features. Example of our recommendations based on pure book metadata features. Notice how it picks up on all the other books from the author despite author not being a feature we included in our model.

How it Works

We use a hybrid recommender system to power our recommendations. Hybrid systems are the combination of two other types of recommender systems: content-based filtering and collaborative filtering. Content-based filtering is a method of recommending items by the similarity of the said items. That is, if I like the first book of the Lord of the Rings, and if the second book is similar to the first, it can recommend me the second book. Collaborative filtering is a method by which user ratings are used in order to determine user or item similarities. If there is a high correlation of users rating the first Lord of the Rings book and the second Lord of the Rings book, then they are deemed to be similar.

Our hybrid system uses both of these approaches. Our item similarities are a combination of user ratings and features derived from books themselves.

Powering our recommendations is the Netflix-prize winner SVD algorithm^[2]. It is, without doubt, one of the most monumental algorithms in the history of recommender systems. Over time, we are aiming to improve our recommendations using the latest trends in recommender systems.

SVD for Ratings Matrix

What makes the SVD algorithm made famous during the Netflix challenge different than standard SVD is that it does NOT assume missing values are 0^[3]. Standard SVD is a perfect reconstruction of a matrix but has one flaw for our purposes: if a user has not rated a book (which is going to most books), then SVD would model them as having a 0 rating for all missing books.

In order to use SVD for rating predictions, you have to update the values in the matrix to negate this effect. You can use Gradient Descent on the error function of predicted ratings to accomplish this. Once you run Gradient Descent enough times, every value in the decomposed matrix begins to better reflect the correct values for predicting missing ratings, and not for reconstructing the matrix.

Evaluation Metrics

As with all Machine Learning based projects, you want to make sure what you have used is ‘better’ than other popular methods. As stated before, we used RMSE to evaluate the performance of our trained Latent Factor (SVD) model. Below are the RMSE for several algorithms we calculated while building this project.

There are two widely used metrics in recommender systems that we also use. The Mean Squared Error, otherwise known as MAE, is the average difference between a predicted rating an the actual rating. Its close cousin, Root Mean Squared Error (otherwise known as RMSE) is still an average distance, but the difference between the predicted rating and the actual rating is squared, meaning that it is much more costly to miss something by a large margin than to miss something by a small margin.

Approach	Params	Data	RMSE
User k-NN	k=80	Goodreads (subset 20%)	0.864
User k-NN	NA	Full (Goodreads + Amazon)	0.8875662310051954
Item k-NN	NA	Full (Goodreads + Amazon)	0.8876182681047732
SVD	factors=300, epochs=100	Full (Goodreads + Amazon)	0.842684489142339
SVD	factors=10, epochs=50	Full (Goodreads + Amazon)	0.844118472532902
SVD	factors=1000, epochs=20	Full (Goodreads + Amazon)	0.8627727919676756
Autoencoder	X-300-200-300-X	Full (Goodreads + Amazon)	0.893

Note: Not all results from HPC grid search are shown here, only the top model from each batch (small params, large params, medium params).
Note: The Autoencoder (inspired by this paper^[4]) results are highly experimental and need further hyperparameter optimization.

Our final model uses the SVD with 300 factors trained with 100 epochs. Overall, the lower factor models consistently had the best performance versus the very high factor models, however this middle ground (300 factors, 100 epochs) was the absolute best result from our grid search. We also subjectively liked the recommendations it gave for test users more than the very small factor model. This is because with only 10 factors, the model is very generalized. While this might provide small error for rating predictions, the recommendations it gave seemed to make no sense.

Why Hybrid?

Why would we not just use one, hyper-optimized Latent Factor (SVD) Model instead of combining it with a Content Based model?

The answer is simply a pure SVD model can lead to very nonsensical, ‘black box’ recommendations that can turn away users. A trained SVD model is simply trying to assign factor strenghts in a matrix for each item in order to minimize some cost function. This cost function is simply trying to minimize the error of predicting hidden ratings in a test set. What this leads to is a very optimized model that, when finally used to make recommendations for new users, can spit out very subjectively strange recommendations.

For example, say there is some book A that after being run through a trained SVD model, is most similar in terms of ratings as a book B. The issue is that book B can be completely unrelated to A by ‘traditional’ standards (what the book is about, the genre, etc). What this can lead to is a book like Lord of the Rings Return of the King ending up being most ‘similar’ to a completely unrelated book like Sisterhood of Traveling Pants (yes this happened). This is because it could just be the case that these two books happen to always be rated similarly by users and thus, the SVD model learns to always recommend these books together because it will minimize it’s error function. However, if you ask most fantasy readers, they would probably prefer to be recommended more fantasy books (but not just all other books by Tolkien).

What this leads to is trying to find a balance between exploration (using SVD to recommend books that are similar only in how they are rated by tens of thousands of users) and understandable recommendations (using Content features to recommend other fantasy books if the user has enjoyed the Lord of the Rings books). To solve this issues, we combine the trained SVD matrix with the feature matrix. By doing this, when we map a user to this matrix, the user is mapped to all the hidden concept spaces SVD has learned. Then all the books that model returns are then weighted by how similar they are to the features of the books that the user has highly rated. By doing this, you will get recommendations that are not purely within the same genre that you enjoy, but also not completely oblivious to the types of books you like.

Project Structure

Data Sources

Goodbooks 10k

6 million ratings from Goodreads here: goodbooks-10k repository. Along with ratings, this data also includes great book metadata that was used for the Content Based Model.

Amazon Ratings

The Amazon ratings were kindly provided by Jure Leskovec and Julian McAuley^[5]^[6]. We used the subset of the book ratings that matched the Goodbooks 10k dataset.

Data Preprocessing

Data preprocessing is one of the (if not the) most significant part of any Data Science project. The most difficult part of our data preprocessing was joining the Goodreads data and the Amazon ratings together. The Amazon ratings were attached an Amazon Standard Identification Number (ASIN), but not an ISBN. We mapped the ASIN to book titles, the Goodreads book ids to book titles, and did a hash-join on the two title sets to join both sets of ratings together.

In order to see the difference between the rating distribution between the two datasets, we used visualizations. The visualizations were generated using R programming language.

The next step was generating the book features, which was done by constructing tf-idf vectors of the book descriptions, tags, and shelves. There were also a lot of missing images in the Goodreads dataset, which decreased the quality of our web app by a lot, and so these images were re-obtained from Goodreads.

After these steps, the data was clean enough to be server on the web server and converted into a numerical format that was able to be consumed by Machine Learning algorithms.

RapidMiner

RapidMiner is a Data Science platform that allows for rapid prototyping of Machine Learning algorithms. We used RapidMiner to get a ‘feel’ for our data. It was great for quickly applying models and seeing their results, but it proved inflexible, and it could not handle more than 12000 users until there was a memory error or an array overflow. They were able to achieve a RMSE of 0.864 and a MAE of 0.685.

Surprise

Surprise is a Python library designed to generate recommendations and evaluate recommenders. It provides a nice API and a nice pipeline for recommender systems, but we found that it was not as malleable as we wanted it to be. It proved to be quite difficult getting different sorts of recommenders to work nicely with it’s pipeline, but standard algorithms like SVD was a breeze.

We used the Surprise library in order to do matrix factorization on the user-item matrix. The SVD algorithm of Surprise uses Gradient Descent to optimize the RMSE, which is one of our end goals. This differs from the regular SVD, where the regular one tries to minimize the matrix reconstruction error. The crucial difference is that Surprise does not assume that unrated items are rated as 0.

There are multiple hyperparameters one can use for training the SVD model. We used Grid Search on the hyperparameter space in order to find the best hyperparameters, with the help of NYU High Performance Computing.

Recommendation Pipeline

In order to have better control over the recommendations, we built our own recommendation pipeline. This pipeline takes as input the preprocessed ratings and book features, uses SVD to learn the item-concept matrix for both ratings and book features, combines the two results, calculates book similarities, and produces recommendations. For testing, this pipeline also includes k-fold cross validation and the calculation of error metrics.

Web App

Our web application is powered by Flask, the easy to use Python web framework. As mentioned above, our website is live for you to test your recommendations with.

Tools Used

Surprise: See Surprise
Rapidminer: See See RapidMiner
RStudio: We used RStudio for Data Understanding visualizations
Jupyter Notebook: For testing all aspects of the Project Lifecycle. Code was moved to a general Util API folder once deemed useful
Python: The language of choice for the project and the web app
Pandas: Used to store books with all their metadata and also to store the user-item ratings
Hadoop (on HPC Dumbo): Used to get baseline metrics for collaborative filtering. Precomputation of item-item similarity matrix using large item-feature matrix on spark. This is used as input to content-based recommendation model in Mahout.
HPC (NYU Prince): There are multiple hyperparameters one can use for training the SVD model. We used Grid Search on the hyperparameter space in order to find the best hyperparameters, with the help of NYU High Performance Computing. The code for that can be found in the HPC folder.
Scikit-learn: We used Scikit-learn to run our vanilla SVD on item features.
Scipy: Used for efficiently storing sparse matrices (ratings matrices are extremely sparse)
Tensorflow: We used Tensorflow to test our Autoencoder, which was used to generate representations of items similar to how SVD on the item features work. Unfortunately, there are a lot of different hyperparameters to optimize with Deep Neural Networks, and we made better use of our time by focusing on the web app than the Autoencoder.
Flask: See Web App
Digital Ocean: Our web application is hosted on a DO server. We selected 1gb of memory as to be a lightweight deployment

Creators

Acknowledgements

Dr. Anasse Bari - Project Advisor

References

http://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers
The BellKor Solution to the Netflix Grand Prize
Generalized Hebbian Algorithm for Incremental Latent Semantic Analysis
Hybrid Recommender System based on Autoencoders
R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016
J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

The Effective Engineer - TLDR

2018-05-16T14:05:14+00:00

A TLDR of The Effective Engineer by Edmond Lau

Before starting my internship this summer at Bluecore, I was given this book to read by the team. Instead of just reading through it and probably forgetting most of it within a month, I wanted to do a type of book report. Writing down the key ideas and condensing the information into my own words helps with retaining information. We learned about this strategy in my overview of Learning How to Learn on how chunking helps make things stick.

Part 1 - Adopt the Right Mindsets

In the first part of the book, Lau focuses on how you, the individual, can increase your value creation.

Focus on High-Leverage Activities

Here Lau debunks the myth of the startup or company where engineers work 70-90 hours a week in order to be successful. Instead of pure hours worked, Lau explains how you should instead focus only on high leverage activities

Use Leverage as Your Yardstick for Effectiveness

Leverage = Impact Produced / Time Invested

80-20 Rule: 80% of your impact will come from 20% of your time. Find what that 20% is! Hint: It’s not going through your emails or sitting in meetings.

1 hour each day for 20 days is only 1% of an engineer’s yearly hours worked. Yet, 1 hour each day for the first 20 days of an engineer’s job can snowball tha learning into massive leverage gains later on.

Three ways to Increase Leverage

Reduce the time it takes to complete an activity
Increase the output of an activity
Shift focus to a higher leverage activity

Here are a few tips to accomplish the above:

Condense meetings into 30 minutes instead of a half hour
Prepare an agenda for meetings to streamline them
Replace non-important meetings with emails
Automate parts of your development or testing workflow
Focus on launch critical tasks
Use profiling tools to instantly know where bottlenecks are versus using time finding them

Focus on Leverage Points, not Just Easy Wins

People think the Leverage equation can be gained by just focusing on a ton of ‘low-hanging’ fruit. This is wrong as a bunch of small wins will still be eclipsed by large, critical, high impact tasks.

Tims is like money. You want to invest it in what will give the biggest return. You wouldn’t use your days going around collecting quarters off the street. Instead, invest into things with high payouts.

Optimize for Learning

Learning is the single most important investment you can make in yourself (except maybe health and fitness but there are other books on that). Figure out how to optimize your time so you are learning as quickly and early as possible as knowledge is like money: it has compound interest.

A big factor in deciding to work somehwere is if you think you will be challenged enough to always be learning.

Adopt a Growth Mindset

Be a Yes Man to learning opportunities. I actually did this with my life a few years ago (say yes to everything) and it really does change your life.

Force yourself to believe that you can always learn new things and skills and that belief will actually get you to to partake in activities that foster growth.

Invest in Your Rate of Learning

Learning is like money: it is affected by compound interst.

Invest early and invest in high return knowledge (things that will make more future learning easier). For me, this is math as many of the things I want to do require a solid foundation in Mathematics.

Seek Work Conducive to Learning

We spend most of our waking hours as work. You should find ways to learn while at work.

Below are criteria you should be measuring when deciding where to work:

Fast Growth: is the company growing fast or just kinda floating along?
Training: the company should offer extensive training opportunities (not just for new hires!)
Openness: Different teams should not be isolated and closed off
Pace: you should be able to get things done quickly and not get bogged down by miles of red tape
People: never be the smartest person in the room
Autonomy: you should have some freedom and say in what you really want to work on

Dedicate Work Time for Learning

Google’s 20% Time: Engineers get one day a week to work on whatever side project they think will help the company in the long run.

Ten suggestions on how to learn on the job:

Study other’s code
Write more code
Go through internal technical documentation
Master your most used programming language
Ask for code reviews from the harshest critics
Enroll in classes (even in nearby University)
Participate in design meetings
Work on diverse projects (Interleaving from Learning How to Learn anyone??)
Make sure your team has at least one engineer more senior than you so you can learn
Jump into new code bases without fear

Always be Learning

You should also be learning outside of work. Even learning in other disciplines can trickle over to better work performance.

Here are ten ways to learn:

Learn new programming languages and frameworks
Invest in high demand skills (future proofing)
Read books (checkout Books2Rec for recommendations!)
Join discussion groups or book clubs
Attend meetups, talks, and conferences or even just watch them online
Network and maintain relationships
Follow bloggers that share great knowledge
Write to teach. Trying to teach others is by far the best way to learn material yourself. This is my inspiration for this blog!
Work on side projects
Pursue what you love. Don’t waste time on TV, web surfing, or social media

Prioritize Regularly

Success of your company or yourself requires prioritizing the things that actually matter. Prioritizing is a skill like all others. You can get better at figuring out what is important and also estimating how long things will take.

Track To-Dos

I use Google Calendar and Google Tasks to track everyting I have to and want to do every day and week. There are thousands of apps that do this or even just keep a journal.

Learning How to Learn also showed us the importance of writing down To-Dos. Our working memory is a very limited cache so writing things down helps us clear out our cache for more in the moment things and doesn’t let us forget future things to do.

You should also learn to assign priority and estimated time to your to-do items. Don’t just brain dump them. Tomorrow you will have to use time to sort through it all!

Focus on what Directly Produces Value

There are an infinite amount of things you can do at this moment. However, you should do that things that produce the most value.

A good example Lau provided is saving money. Many people say you should skip the Starbucks $4 coffee every morning as this will save X amount of money a year. However, there are other, shorter (not easier) ways that will save MORE money.

Spending an hour or two researching cheaper high expense items like hotel rooms or tickets can save hundreds of dollars.
Optimizing your stock and savings into higher return accounts will generate more money than saving on coffee every morning.
Negotiating a higher salary can net you tens of thousands of dollars a year for a day’s worth of extra effort.

Of course, saving $4 a day on coffee is still helpful but always look for bigger value items first to tackle!

Focus on the Important and Non-Urgent

This is the famous Qudrant chart.

The lesson is that Important and Non-Urgent is JUST AS HIGH PRIORITY as Urgent and Important.

Protect Your Maker’s Schedule

Engineers need long, uninterupted blocks of time to get things done. Limit your interuptions and make it known you need blocks of ‘do not disturb’ time. This is why working from home can be very helpful.

Limit Amount of Work in Progress

Our working memory can store 7 +/- 2 chunks. If you have 20 ‘In Progress’ tickets in JIRA, something is wrong. Limit this to 2 or 3.

Fight Procrastination

My Learning How to Learn post has valuable information on this topic, much of which is repeated in this chapter of the book.

One new piece of advice is to form a plan of when you will do things in the future. The simple fact of planning our future blocks of time to work on something will actually make you 2-3x more likely to do it.

Make a Routine

Use the Pomodoro Technique to split up your day into 25 minute productive chunks. Aim to maximize how many Pomodoro blocks you complete each day for high leverage work (25 minutes of answering emails does not count).

Split up your To-Do list into a ‘Doing’, ‘Today’, and ‘This Week’ sections and shuffle things from one to the next each day. Only assign as many things as can realistically get done per day.

Part 2 - Execute, Execute, Execute

Part 2 explains how an engineer can actually get stuff done.

Invest in Iteration Speed

Companies are starting to realize that pushing dozens of code changes every day is better than a few large code changes every week or month. Setting up a workflow that allows for this constant iteration is crucial to surviving in today’s fast paced tech scene.

Move Fast to Learn Fast

‘Move Fast and Break Things’ - Facebook mantra. It’s better to move fast and learn from failures quickly than taking it too slow and not taking enough risks. It’s easier to roll back changes and new features than to push dozens of them instantly.

Invest in Time Saving Tools

One way to save time is to create tools that automate of speed up repetetive processes. Even though this slows you down at first, over the long run, it will save more time.

Another technique to save time is to prototype in high level languages like Python. Only use low level languages once you have the prototypes done and approved.

Shorten Your Debugging and Validation Loops

Create code that drops you right into the buggy pieces of code so you can work on debugging them rather than wasting time repeatedly getting back to the bugged code. This example is highlighted with working with CSS styling for specific web pages. Code up a shortcut to take you right to the page you are styling rather than clicking from the login page everytime.

Master Your Programming Environment

Get proficient with your favorite text editor or IDE
Learn at least one high-level language that can be used to prototype ideas (Python)
Get familiar with shell commands (Unix shell). This is VERY important.
Prefer keyboard shortcuts over the mouse
Automate your manual workflows. If you’ve done something the same way twice, automate it for the third time.
Test ideas in an interpretable language and not in one that requires code compilation (C, C++, Java)
Make running unit tests insanely easy and fast by making Make files.

Don’t Ignore Non-Engineering Bottlenecks

Here are a few bottlenecks that can occur and slow down engineering:

Waiting on other people to get things you need (ie Photoshop images from Design Team). Solve this with constant communication with your PM and updating them with what you need.
Obtaining approval from higher-ups. This is hard to solve and should be avoided in the engineering culture.
Review processes (QA). Do not wait to QA all your work at the last minute. QA should happen in real time alongside development.

Measure What You Want to Improve

You need to convert certain goals into numeric values in order to measure them. It’s hard finding what to measure. For example, Google figured out that clicks was not a good thing to measure but the time a user spent in a result page offered by search before returning to research was a great metric.

Use Metrics to Drive Progress

Metrics provide the following pros:

They help you stay focused on the right things
When visualized over time, they help prevent future regressions by pinpointing what changes caused them.
Good metrics drive progress forward at all times (assuming they the metrics are increasing).
Metrics let you measure effectiveness of what you are spending time on and allow you to compare the value of doing other things.

Pick the Right Metric

Metrics need to satisfy the folliwing 3 properties:

Metrics need to maximize impact
Metrics need to be actionable. They should provide info that you can make changes on to respond to. They should not be vague.
Metrics need to be responsive. You need to pick a metric that will respond to a change made today so you can measure it.

Instrument Everything to Understand What’s Going On

You should be measuring EVERYTHING in your company. Not doing this will result in scrambling to figure out why some crucial service died. Large companies can afford to build custom in house software to do this, but there are many third-party measuring software that any tech company can purchase.

Internalize Useful Numbers

Know your numbers! This applies to finance as it does to the technology companies. Knowing your numbers is like knowing your health vitals. Here are a few numbers to always know:

Number of registered users and number of active users
Amount and total data capacity
Amount of data read/written every day
Number of servers a single service takes up
Throughput of services or endpoints
Growth rate of traffic
Page load time (average, per browser, etc)
Traffic distribution accross different pages or services
Distribution accross devices and OS’s

Be Skeptical of Data Integrity

Statistics can lie and liars use statistics. Never blindly trust data. Here are a few ways to increase the trustworthiness of your data:

Log as much data as you can
Build tools that assist data accuracy earlier rather than later
Write integration tests to ensure data quality has not worsened
Examine data soon after it’s collected rather than next week when its needed
Use cross validation (compute the same metric using different pieces of the data)
If your gut tells you a number looks off, it usually is. Investigate!

Validate Your Ideas Early and Often

Optimize for feeback earlier. Don’t spend a year working on a product only to have the end users hate it.

Find Low Effort Ways to Validate Your Work

Create a MVP (Minimum Viable Product) and use this to gather feedback and make changes. Spend the first 10% of your time creating it. This will save massive amounts of time compared to changing features of the final product that took 50% of your time to create.

Validate Product Changes with A/B Testing

A/B testing is when you show a percentage of users one version of the product and the other percentage a control version. You only make changes if the changed version responded with better metrics. A/B testing is a critical part of all of the largest tech companies and it not going anywhere soon. It essentially allows you to use statistics on subjective feedback, something that is traditionally not possible.

Beware the One Person Team

Try to avoid having a single person be responsible for an entire feature or product. However, this is sometimes not possible to avoid. Here are some tips to help this not blow up your company if that person gets run over by a bus:

Be open to feedback
Commit code early and often
Request code reviews from the toughest reviewers
Bounce ideas off diverse range of teammates
Have people review design docs before wasting time coding things up
Structure projects so different teammates have responsibility for different pieces
Get end-user buy in before implementing features

Build Feedback Loops for Your Decisions

Rather than making an important decision and moving on, set up a feedback loop that enables you to collect data and measure how valuable your work has been up to this point.

Improve Your Project Estimation Skills

44% of projects are delivered late, overbudget, or without key requirements. 24% never complete at all 79% is the average time a overrun project exceeds its initial estimation

Use Accurate Estimations to Drive Project Planning

Decompose the project into granular tasks (not one mega ticket)
Estimate on how long a task will take, not how long you desire it to be done in
Estimates are probability distributions. The actual time to complete will be under or over the mean time you have picked (assuming you perfectly picked the mean to begin with!)
Always have the person responsible for the task actually make the estimate
Beware the mythical man-month. Adding an engineer to a one person team does not double the speed of the project. Adding bodies does not follow a linear curve.
Use historical data to fine tune your estimates for future projects. If you always underestimate, add time to this projects estimates.
Don’t let features keep growing in time
Allow others to challenge estimates and don’t make them afraid to speak up

Budget for the Unknown

There will always be things you can’t predict that will eat up at your time. Always budget extra time for these unknowns and budget more ‘unknown’ time for longer projects.

Define Specific Goals and Milestones

As with using metrics to measure the value of a product, you can use goals and milestones to measure the growth and value of a development project. Goals and milestones also help with not feeling burnt out.

Reduce Risk Early

Always tackle the hardest and riskiest features first. You don’t want to complete 9/10 easy features and then realize the 10th is not doable and have to scrap the project.

Beware Project Rewrites

Engineers are always thinking a rewrite will be easy, quick, and much much better than what already exists. 99% of the time they are wrong.

Don’t Sprint in the Middle of a Marathon

Only do a spring or crunch if the end is actually near. If you sprint in the middle of a project, you are dooming your team and will not be able to finish. Here are a few facts to keep in mind:

Productiviy of a single hour decreases as you work more hours
You are more behing schedule than you think you are. This is especially true in the early stages
More hours can leas to burn out
Working more hours can lead to team members resenting each other
Communication will ramp up as deadlines approach and this can lead to bottlenecks and overhead
Sprinting and crunch times increase technical debt

Part 3 - Build Long-Term Value

Companies are trying to maximize profit and revenue in the long run, not just for this year. There are many tips and tricks to maximize value in the long-term.

Build Quality with Pragmatism

Software quality is always a trade off between actually getting things done and having solid, bug free code. Large companies like Google can get away with insanely difficult code standards to achieve while this level of confinement would cripple any start up that needs to get a product up and running.

Establish Sustainable Code Review Process

Code reviews should be present in any tech company. Here are a few reasons why:

Code reviews catch bugs and design flaws early
Knowing you have to be reviewed makes you less likely to commit quick and dirty code
Allows others to see what the company’s code standards are
Distributes the knowledge of the code base to others
Increases long term agility because the code base won’t be riddled with bugs

Manage Complexity Through Abstraction

Eventually, you will want to do things that are so advanced, you can’t expect single engineers to recreate the entire code needed. Just as we don’t all code in Assembly, we want to abstract as many things as possible. A great example is the Google File System (that turned into Hadoop). This abstraction of running algorithms on distributed data allowed engineers to focus on writing algorithms rather than managing the insanely complex intracacies of distributed data.

Automate Testing

Automating testing allows you to keep developing without the fear that your changes are breaking existing features. Spend time setting up large scale and comprehensive testing scripts that cover most of your code and features.

Repay Technical Debt

Eventually techincal debt needs to be dealt with. However, you can pick and choose which debts to pay off. LinkedIn for example spent months after going public freezing all code changes and only fixing technical debt in the code base.

Minimize Operational Burden

Avoid tech or tools that will create headaches later on. Use stable and trusted software to build your products with. Keep it simple stupid!

Embrace Operational Simplicity

Always keep things simple. Here are a few things that happen if you don’t:

Engineering expertise gets splintered accross multiple systems
More complex pieces of the puzzle introduce more points of failure
New engineers will take longer to get up to speed
Complexity takes time away from testing, abstraction, etc

Build Systems to Fail Fast

Failing fast is actually preferable to staying alive after a problem has occured. This doesn’t seem to make sense but this actually lets you solve the real issues as they occur rather than after they have corrupter other parts of the product. Always deal with critical errors as they occur and don’t let them slip by undercover for weeks or months.

Relentlessly Automate Mechanical Tasks

Here are a few things that should be automated (versus doing manually):

Validating code or running tests
ETL of data
Detecting error rate spikes
Deploying software to new machines
Restoring database snapshots
Running batch computations
Restarting a web service
Checking code styles (use a linter)
Training machine learning models
Managing user accounts and data
Removing or adding servers to services

Make Batch Process Idempotent

Idempotent: running the same script over and over, no matter how many times, gives back the same results. This is the same idea as non-mutable code. You shouldn’t have your state affected by running something. This makes it harder to track down errors.

Be Able to Respond and Recover Quickly

Create a Chaos Monkey: a piece of software the creates havoc in your systems, brining services down and other pieces of infrastructure. This allows you to know how to support your products even with individual pieces going down. All of the major tech companies now use Chaos Monkeys to test their products. It’s better to fail in a test environment than in the real world.

Invest in Your Team’s Growth

Your individual success comes from how successful your company is and your company’s success stems from how successful all of the individual workers are. This creates a nice Game Theory environment where it’s in everyone’s interest to help each other and the company.

Make Hiriing Everyone’s Responsibility

Hiring might be one of the most critical parts of your company’s long term health. Hiring is responsible for who is working on and developing all the things that need to be done. Also, hiriing is responsible for who your teammates will be so hiring should be everyone’s priority.

Design a Good Onboarding Process

Onboarding is critical to getting new hires up to speed and productive. Every hour they aren’t onboarded is a loss to the company’s bottom line. Here are 4 goals all onboarding processes should accomplish:

Ramp up new engineers as quickly as possible
Impart the team’s culture and values so everyone is on the same page
Get new engineers to master the fundementals critical to this company’s products and tools
Get new engineers socially integrated with exisiting team members

Mentorship is also a widely used and effective onboarding technique.

Sharing ownership of code helps team moral and also prevents projects from being derailed if an employee leaves or is hit by a bus. Here are some ways to increase code ownership:

Avoid one person teams
Do code reviews
Rotate tasks around the team
Keep code quality high. It’s easy to own code if you can know what it does quickly
Do tech talks and show and tells
DOCUMENT CODE!!
Document workflows and non-obvious things to do to get things working
Have mentors

Build Collective Wisdom Through Post-Mortems

It’s easy to pat yourself on the back after a successful launch and to glow in the praise, but you actually learn more from dissecting what went wrong on a failed project. Set aside your ego and talk about it.

Build a Great Engineering Culture

People want to work at places with good culture. Here are what the best of the best are looking for in a company:

High iteration speed companies. Engineers don’t want to wait a month to push one feature
High value for automation
Great abstractions for complicated services
High code quality
Respectful work environment
Shared ownership of code
Automated testing and solid QA
Allow for expirimentation time (hackathons or 20% time)
Foster a high learning environment
Tough hiring standards. People will want to work with others they know got through the gauntlet.

Recommender Systems Overview

2018-04-21T14:05:14+00:00

Below is a presentation I did for my Big Data Science class on Recommender Systems

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

Learning How to Learn

2018-04-10T14:05:14+00:00

An Overview of Coursera’s Course on Learning

What is Learning

Focused versus Diffuse Thinking

Focused thinking is when we actively concentrate on the matter at hand. We are in an attentive state of mind and we often focus on a small set of information at at time. Diffuse thinking is when we let our mind wander freely. It is the type of thinking you engage in when you are ‘daydreaming.’ Both of these types of learning are important.

One way to ensure you enter diffuse mode from time to time is to take breaks during focused learning. A technique to accomplish this is the Pomodoro Technique. Basically you set a timer (for like 25 minutes) and perform focused learning during that time. After the 25 minutes is up, take a 5 minute break and allow your mind to wander.

One of the biggest traps we can fall into is trying to learn everything in a short amount of time. This is like a weight lifter trying to train his muscles for a competition the day before. It’s not going to happen. Your brain is like your muscles: it needs time to build itself up and make new connections. Approach learning like you do exercising: consistent work everyday is the only way to build a solid foundation.

Procrastination, Memory, and Sleep

Everyone suffers from some level of procrastination. One way to combat this is to use the Pomodoro Technique. Using the timer to ensure you actually have blocks of focused learning can help tremendously in combating procrastination.

To stave off procrastination, you should also minimize distractions. Find a quiet place to study and turn off your phone. Also, keep in mind that procrastination is often caused more from the pain of thinking about the thing you have to complete. Often, the process of getting things done is not as scary. If you can get yourself to just start the process, you’ll push way that pain of thinking about what you still have to do.

Your memory is made up of short term and long term memory. Long term memory is like a warehouse where you can store millions of things. However, getting things back can be hard. Short term memory is like a blackboard that is always slowly fading away. In order to store things in long term memory, you need to practice, practice, and practice what you are trying to learn. Just writing it on the blackboard will not make it ‘stick.’

Sleep is critical to learning. We still don’t know what exactly sleep does for us, but we know exactly what lack of sleep can do. Lack of sleep literally builds up toxins in your body and many of these can damage your brain.

Chunking

Chunking - The Essentials

Chunking is when you break up what you want to learn into smaller pieces. You can think of this as an analogy to a puzzle. The concept you are trying to learn is the complete puzzle. To solve it, you have to learn all the individual pieces and then put them together in the right way. The part of finding the individual pieces is chunking.

As you learn more and more chunks, you can make each chunk bigger and bigger. Chunks can also ‘transfer.’ What this means is that learning chunks in one area can actually help learn other things.

Seeing the Bigger Picture

Often times it is better to understand the big picture before starting to chunk the concept. Have you ever been in a math class where you just start learning equations and random relationships while having no idea what the whole point is? If you have, you’ve been victim to trying to solve a puzzle without even know what the complete picture is supposed to be.

The Illusion of Competence is when you trick yourself into thinking you know a subject. Some people will read and reread the material and do simple problems over and over again, thinking they are mastering the subject. Often times, they are kidding themselves. To really know a subject, you can do multiple things. Test yourself with difficult question. Also, test yourself across multiple sections rather than just sticking in one subject at a time. This is called interleaving and is shown to improve learning.

Einstellung is when you have become stuck in one way of thinking. It’s caused by a neural pattern that is ingrained in your brain. It’s hard to think about something in a different way because of this.

Procrastination and Memory

Procrastination

Remember, learning is like lifting weights: you can’t do it all in one day. Just like building muscle mass, changing the structure of your brain with new knowledge is a slow and difficult journey. Because it takes so long for real learning, you can’t save it all for the last minute. To help with procrastination, keep a journal. Set goals and write down tasks. This can add a sense of reward to knocking things off the lists. Also, do the hardest things first in the day. Whatever makes you feel most uncomfortable thinking about it, do that first.

Memory

Because it takes time and effort to move things into the long term memory warehouse, you should always start learning early. In order to do this, you should learn how to manage your procrastination.

As you get better and better in a subject, your short term memory becomes more efficient. This is because you can store ideas and new things in smaller chunks on the blackboard. You can think of this like compression. You are able to save the same amount of memory in smaller blocks.

One tool to help with memorization is to group things together with meaningful connections and also to use analogy and metaphor. Remember, even the most complex models in math and science are just glorified metaphors of the universe.

Renaissance Learning and Unlocking Your Potential

Summary

Metaphors and analogies aren’t just for art and literature. One of the best things you can do to not only remember, but more easily understand concepts in many different fields, is to create a metaphor or analogy for them. Often, the more visual, the better.

Try to avoid ‘genius envy’. Sure, intelligence is real and some people are smarter than others, but that doesn’t paint the whole picture. You don’t need to be a genius to work hard at mastering how to learn. And the person who spend the hard work learning things will always beat the smart person who does not apply themselves to anything.

Many people still suffer from imposter syndrome. This is when you feel inadequate and think everyone around you is so much smarter and you just haven’t been found out yet. The truth is almost everyone feels this way at times. That’s why the term ‘fake it till you make it’ is often good advice because almost no one is 100% confident in themselves.

When preparing for tests, study in groups. Studying with others helps you see the material in a different way, get asked questions you wouldn’t have thought of yourself, and also can help you learn even more by ‘teaching’ the material to others.

Conclusion

If you haven’t already, take Coursera’s Course on Learning. You’d be surprised how many little things people would think obvious or too simplistic can actually make a difference in your learning. Why would you not want to master a skill that will make all future skill learning easier?

Recommender System with Rapid Miner

2018-03-02T23:05:14+00:00

User k-NN Collaborative Filtering for Item Recommendations - A step by step guide in Rapid Miner

As part of a class for NYU, a team of 3 of us are building a recommendation system for books. To quickly prototype a dead simple recommender system, we put together a simple Rapid Miner workflow. You can read more about this here at Doruk Kilitcioglu’s blog. Below are is the step by step guide we used to get results from Rapid Miner for item recommendations using user-user collaborative filtering.

Download ratings.csv from http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/
1. NOTE: If you don’t have an educational license with RapidMiner, you can only load in 10k rows. Open and edit the ratings file and trim it down to 10k rows.
2. You can get an education license from the RapidMiner website if you make an account and add an .edu email
Download RapidMiner and install to your machine
Start a New Process and make it Blank
Loading the Data
1. Hit Add Data at the top left under repository
2. Click on My Computer and find ratings.csv from your local machine
3. Hit all the Next buttons and then save the file under data
4. This might take up to a minute
5. From the top left, expand Local Repository, then data, and then drag ratings.csv to the right window
6million ratings is too much for RapidMiner to process so let’s filter it down
1. Find the Filter Examples operator and drag to the right window
2. Hook up the output of Retrieve ratings to the input of Filter Examples
3. Click on Filter Examples and click on the Add Filters button to the far right
4. Ensure user_id is selected as the left field
5. Make the filter operator (should be = by default) a <
6. Type in anywhere between 500 to 1000
7. Hit OK
Set the role of the columns
1. Add the Set Role operator to the window
2. Click on the box
3. At the far right, from the attribute name drop down, select rating and set the target role to label
1. Click on the Edit List button
  1. Make the left field user_id and at the right field, TYPE in user identification
  2. At the bottom hit Add Entry
  3. Made the new left field book_id and at the right field, TYPE in item identification
2. Hit Apply
Split data into train and test
1. Add the Split Data operator to the right window
2. Hook up output of Set Role to Split Data
3. Click on Split Data and hit the Edit Enumeration bottom in the top right
4. Add two entries
5. Type in the first one as .8 (this is the train set)
6. Type in the second one as .2 (this is the test set)
7. Hit OK
Add Recommender System algorithm
1. At the very top right, hit Extensions and go to the Marketplace
2. Type Recommender in the search bar
3. Install Recommender Extension and follow the instructions to install
Add User k-NN item recommender system
1. Find the Collaborative Filtering Item Recommendation/ User k-NN operator (will be in Extensions under Recommenders/Item Recommendation)
2. Drag this to the right window
3. Hook up the top output of the Split Data box to the input of the User k-NN box
Apply the model to train and test
1. Add Apply Model (Item Recommendation) operator to the right window
2. Hook up the Mod output of the User k-NN to the input Mod of the Apply Model box
3. Hook up the second par output of the Split Data box to the que input of the Apply Model box
4. Drag the res output of the Apply Model box to the res on the very far right of the window (the final output)
Hit the big blue Run button to view the output!
1. The model will recommend items (books) to users based on the books other users very similar to them have read
Please view the image below if you are stuck
View performance metrics
1. Delete the Apply Model box
2. Add the Performance (Item Recommendation) operator
3. Hook up the Mod output of the User k-NN box to the Mod input of the Performance box
4. Hook up the exa output of the User k-NN box to the tra input of the Performance box
5. Hook up the second par output of the Split Data box to the tes input of the Performance box
6. Hoop up the per output of the Performance box to the res at the very far right of the window
Hit the big blue Run button to view the output!
1. The output will be a slew of performance metrics for the Item Recommendations
2. The AUC (Area Under of the Curve) can be treated as an accuracy metric
Please view the image below if you are stuck

Notes and further exploration:

You can use this set up on any set of ratings as long as the input csv follows the following format (User_id, item_id, rating) and you make sure to set the roles to exactly user identification, item identification and label as explained in the steps above
You can predict the ratings on the test set instead of predicting good recommendations. Swap out the Item Recommendation User k-NN with Rating Prediction User k-NN if you would rather predict the ratings that users have given their books
Play around with Item k-NN or other operators. These operators find items that are most similar to other items in order to make recommendations. What we used above found most similar users to other users in order to recommend items
1. Please read more on recommender systems and techniques to make them. This post is meant to be a step by step guide for RapidMiner and not an explanation on recommender systems