Deep Music Genre Classification

In this blog post, I will use PyTorch to train neural networks that classify music genre based on lyrics and some engineered features that describe song attributes.

Data Preparation

To start off, I will read the data and turn them into Dataset objects that can be accessed by PyTorch.

# download dataset
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)
engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']

df.head()

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
0	0	mukesh	mohabbat bhi jhoothi	1950	pop	hold time feel break feel untrue convince spea...	95	0.000598	0.063746	0.000598	...	0.380299	0.117175	0.357739	0.454119	0.997992	0.901822	0.339448	0.137110	sadness	1.0
1	4	frankie laine	i believe	1950	pop	believe drop rain fall grow believe darkest ni...	51	0.035537	0.096777	0.443435	...	0.001284	0.001284	0.331745	0.647540	0.954819	0.000002	0.325021	0.263240	world/life	1.0
2	6	johnnie ray	cry	1950	pop	sweetheart send letter goodbye secret feel bet...	24	0.002770	0.002770	0.002770	...	0.002770	0.225422	0.456298	0.585288	0.840361	0.000000	0.351814	0.139112	music	1.0
3	10	pérez prado	patricia	1950	pop	kiss lips want stroll charm mambo chacha merin...	54	0.048249	0.001548	0.001548	...	0.225889	0.001548	0.686992	0.744404	0.083935	0.199393	0.775350	0.743736	romantic	1.0
4	12	giorgos papadopoulos	apopse eida oneiro	1950	pop	till darling till matter know till dream live ...	48	0.001350	0.001350	0.417772	...	0.068800	0.001350	0.291671	0.646489	0.975904	0.000246	0.597073	0.394375	romantic	1.0

5 rows × 31 columns

I will look at three genres: hip hop, country, and rock. Since the genre labels are strings, I will encode them with integers.

genres = {
    "hip hop": 0, 
    "country": 1, 
    "rock": 2
}

df = df[df["genre"].apply(lambda x: x in genres.keys())]
df["genre"] = df["genre"].apply(genres.get)
df.head()

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["genre"] = df["genre"].apply(genres.get)

	Unnamed: 0	artist_name	track_name	release_date	genre	lyrics	len	dating	violence	world/life	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
7042	20290	lefty frizzell	if you've got the money i've got the time	1950	1	money time honky tonkin time night spot dance ...	59	0.022813	0.001074	0.001074	...	0.001074	0.001074	0.523448	0.655488	0.833333	0.000095	0.955688	0.392373	night/time	1.0
7043	20292	lefty frizzell	i want to be with you always	1950	1	lose blue heart stay go sing song wanna dear n...	24	0.002288	0.002288	0.002288	...	0.205663	0.091285	0.705405	0.594980	0.777108	0.000229	0.717642	0.226202	music	1.0
7044	20293	lefty frizzell	how long will it take (to stop loving you)	1950	1	long count star long climb mar long world stan...	34	0.001595	0.001595	0.119458	...	0.001595	0.001595	0.780136	0.583109	0.892570	0.000052	0.706307	0.180155	night/time	1.0
7045	20297	lefty frizzell	look what thoughts will do	1950	1	think love think love look thoughts today wear...	44	0.001253	0.001253	0.308536	...	0.001253	0.039916	0.716235	0.609697	0.734939	0.000000	0.703215	0.249226	world/life	1.0
7046	20300	lefty frizzell	treasure untold	1950	1	dream eye blue love forever long dear want nea...	36	0.001698	0.001698	0.140714	...	0.001698	0.001698	0.703238	0.648848	0.685743	0.000000	0.384790	0.219195	romantic	1.0

5 rows × 31 columns

# create Dataset class using features of interest
from torch.utils.data import Dataset, DataLoader
import torch

engineered_feature_indices = [df.columns.get_loc(col_name) for col_name in engineered_features]

class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.df = df
    
    def __getitem__(self, index): # get lyrics, engineered features, and genre labels
        lyrics = self.df.iloc[index, 5]
        engineered = self.df.iloc[index, engineered_feature_indices].tolist()
        labels = self.df.iloc[index, 4]
        return lyrics, engineered, labels

    def __len__(self):
        return len(self.df)

# train-test split
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df,shuffle = True, test_size = 0.2)
train_data = TextDataFromDF(df_train)
test_data  = TextDataFromDF(df_test)

Each element of the dataset is a tuple containing the lyrics, engineered features, and integer labels.

train_data[0]

('feet feet fight missiles spear thirtyone seventeen soldier thousand years catholic hindu atheist buddhist baptist know shouldn kill know kill friend fight fight fight fight russians fight japan think fight democracy fight reds say peace decide live see write wall hitler condemn dachau stand give body weapon kill universal soldier blame order come away come brothers',
 [0.0015037595085359,
  0.7092493819987846,
  0.0015037595507576,
  0.0015037594168285,
  0.0015037594030349,
  0.0015037594155051,
  0.0015037594066574,
  0.177928468124219,
  0.001503759427317,
  0.0015037594637539,
  0.0015037594607704,
  0.0015037594349133,
  0.0502747402292334,
  0.0015037594141537,
  0.0399910181175007,
  0.0015037594017228,
  0.4877071374417849,
  0.5141142989000845,
  0.780120261164921,
  0.0,
  0.414674361088211,
  0.208183478803342],
 2)

Text Vectorization

Next, I will vectorize the lyrics text using similar approaches as our text classification lecture notes.

# create tokenizer and vocabulary
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter): 
  '''
  loop through our dataset and tokenize lyrics
  '''
  for lyrics, _, _ in data_iter:
      yield tokenizer(lyrics)

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"], min_freq = 50) # only include tokens that appeared at least 50 times
vocab.set_default_index(vocab["<unk>"])

To check if the tokenizer and vocabulary are working correctly:

tokenized = tokenizer(train_data[100][0])
print(tokenized)
print(vocab(tokenized))

['peepbo', 'peachblow', 'pandour', 'pompadour', 'pale', 'leaf', 'pink', 'sweet', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'peepbo', 'peachblow', 'pandour', 'pompadour', 'pale', 'leaf', 'pink', 'sweet', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'peepbo', 'peachblow', 'pandour', 'pompadour', 'pale', 'leaf', 'pink', 'sweet', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'predaria', 'predo', 'pradari', 'peepbo', 'peachblow', 'pandour', 'pompadour', 'pale', 'leaf', 'pink', 'sweet', 'peepbo', 'peachblow', 'pandour', 'pompadour', 'pale', 'leaf', 'pink', 'sweet', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'deer', 'peepbo', 'animal', 'peepbo', 'peepbo', 'peachblow', 'pandour']
[0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 798, 0, 0, 0, 0]

Batch Collation

For the last step of data preparation, I will create DataLoaders that pass batches of data to the training loop.

Before creating the DataLoaders, I will represent the lyrics with integers, and pad them to make them have uniform lengths.

# determine max length of the lyrics
max_len = 0
for data in train_data:
  length = len(data[0].split())
  if length > max_len:
    max_len = length

# represent lyrics with integers and pad them
num_tokens = len(vocab.get_itos())
def text_pipeline(x):
    tokens = vocab(tokenizer(x))
    y = torch.zeros(max_len, dtype=torch.int64) + num_tokens
    y[0:len(tokens)] = torch.tensor(tokens,dtype=torch.int64)
    return y

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    lyrics_list, label_list = [], [],
    engineered_tuple = ()
    for (_lyrics, _engineered, _label) in batch:
      # add lyrics to list
      processed_lyrics = text_pipeline(_lyrics)
      lyrics_list.append(processed_lyrics)

      # add engineered features to tuple
      engineered_tuple += (_engineered, )

      # add label to list
      label_list.append(_label)


    lyrics_tensor = torch.stack(lyrics_list)
    engineered_tensor = torch.tensor(engineered_tuple, dtype=torch.float32)
    label_tensor = torch.tensor(label_list, dtype=torch.int64)
    
    return lyrics_tensor.to(device), engineered_tensor.to(device), label_tensor.to(device)

train_loader = DataLoader(train_data, batch_size=16, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_data, batch_size=16, shuffle=True, collate_fn=collate_batch)

Each batch of data returns three tensors: the lyrics, engineered features, and labels.

next(iter(train_loader))

(tensor([[ 190,    5,  190,  ..., 1372, 1372, 1372],
         [  35,  137,   62,  ..., 1372, 1372, 1372],
         [  84,   27,   27,  ..., 1372, 1372, 1372],
         ...,
         [   0,   27,  257,  ..., 1372, 1372, 1372],
         [  13,    3,    0,  ..., 1372, 1372, 1372],
         [  36,   57, 1041,  ..., 1372, 1372, 1372]], device='cuda:0'),
 tensor([[1.2563e-01, 8.2237e-04, 8.2237e-04, 7.9477e-02, 7.4051e-02, 8.2237e-04,
          1.8140e-02, 1.6188e-01, 8.2237e-04, 8.2237e-04, 8.2237e-04, 8.2237e-04,
          8.2237e-04, 8.2237e-04, 7.4127e-02, 4.5682e-01, 5.1045e-01, 7.6297e-01,
          7.2289e-01, 1.3158e-05, 7.5371e-01, 6.0359e-01],
         [1.3158e-03, 1.3158e-03, 1.2600e-01, 1.3158e-03, 1.3158e-03, 1.3158e-03,
          2.4707e-01, 1.3158e-03, 1.3158e-03, 1.3158e-03, 1.3158e-03, 1.3158e-03,
          1.3158e-03, 1.3158e-03, 6.0587e-01, 1.3158e-03, 3.4907e-01, 7.5292e-01,
          2.6606e-01, 2.6923e-05, 8.4852e-01, 6.2661e-01],
         [5.9137e-04, 5.4297e-01, 5.9137e-04, 5.9137e-04, 1.5789e-01, 4.0467e-02,
          5.9137e-04, 5.9137e-04, 5.9137e-04, 5.9137e-04, 5.9137e-04, 5.9137e-04,
          5.9137e-04, 5.9137e-04, 9.2487e-02, 5.9137e-04, 4.7688e-01, 6.6959e-01,
          1.1144e-02, 6.1741e-05, 3.2296e-01, 8.5185e-01],
         [6.2902e-02, 3.0960e-03, 3.0960e-03, 3.0960e-03, 3.0960e-03, 3.0960e-03,
          3.2338e-01, 3.0960e-03, 3.0960e-03, 3.0960e-03, 3.0960e-03, 8.1015e-02,
          1.3279e-01, 3.0960e-03, 1.8003e-01, 3.0960e-03, 3.2091e-01, 6.1941e-01,
          9.1767e-01, 4.1093e-03, 4.3941e-01, 3.2430e-01],
         [1.8149e-03, 1.9740e-01, 1.8149e-03, 1.8149e-03, 1.8149e-03, 1.8149e-03,
          1.8149e-03, 1.8149e-03, 1.8149e-03, 1.8149e-03, 3.3243e-01, 1.8149e-03,
          3.8736e-02, 1.8149e-03, 4.0422e-01, 1.8149e-03, 4.0214e-01, 4.4807e-01,
          7.0884e-01, 9.7368e-03, 1.7972e-01, 6.6265e-01],
         [5.7208e-04, 5.7208e-04, 5.7208e-04, 5.7208e-04, 5.7208e-04, 1.2899e-02,
          1.3940e-02, 7.4754e-02, 4.6015e-01, 5.7208e-04, 3.5104e-02, 5.5986e-02,
          5.7208e-04, 5.7208e-04, 2.9759e-01, 5.7208e-04, 6.7075e-01, 7.0010e-01,
          7.0481e-02, 0.0000e+00, 8.1451e-01, 7.1971e-01],
         [1.8797e-03, 3.7041e-01, 1.8797e-03, 8.4183e-02, 1.8797e-03, 1.8797e-03,
          3.2794e-02, 1.5335e-01, 1.8797e-03, 1.8797e-03, 1.8797e-03, 1.8233e-01,
          1.8797e-03, 1.8797e-03, 1.8797e-03, 6.3712e-02, 3.2308e-01, 6.6167e-01,
          9.1456e-03, 6.7510e-03, 2.1476e-01, 9.4995e-01],
         [2.5063e-03, 2.5063e-03, 2.5063e-03, 2.5063e-03, 2.5063e-03, 2.5063e-03,
          7.8521e-02, 4.6253e-01, 2.5063e-03, 2.5063e-03, 2.5063e-03, 2.5063e-03,
          6.0003e-02, 2.5063e-03, 3.1478e-01, 2.5063e-03, 6.0576e-01, 5.0617e-01,
          8.9859e-01, 3.4312e-05, 4.8784e-01, 1.3311e-01],
         [8.9206e-04, 8.9206e-04, 8.9206e-04, 8.9206e-04, 8.9206e-04, 8.2017e-02,
          9.4692e-02, 6.1993e-02, 5.8138e-02, 8.9206e-04, 2.1777e-01, 8.9206e-04,
          8.9206e-04, 8.9206e-04, 4.2052e-01, 8.9206e-04, 4.4330e-01, 7.4281e-01,
          7.7711e-05, 1.2348e-01, 6.3108e-01, 7.3573e-01],
         [1.0121e-03, 1.0121e-03, 1.0121e-03, 1.0121e-03, 1.0121e-03, 1.0121e-03,
          1.0121e-03, 1.0121e-03, 4.1138e-01, 1.0121e-03, 3.1253e-01, 1.0121e-03,
          1.0121e-03, 1.0121e-03, 1.5268e-01, 1.0823e-01, 4.3247e-01, 5.3480e-01,
          2.3494e-01, 0.0000e+00, 7.2692e-01, 6.5364e-01],
         [2.3923e-03, 2.3923e-03, 2.3923e-03, 2.3923e-03, 4.7847e-02, 9.7432e-02,
          2.3923e-03, 2.3923e-03, 5.6130e-02, 2.3923e-03, 2.3923e-03, 9.4145e-02,
          2.3923e-03, 2.3923e-03, 5.7453e-01, 2.3923e-03, 6.7183e-01, 7.2410e-01,
          4.9196e-02, 2.0445e-02, 9.7012e-01, 8.3983e-01],
         [1.5949e-03, 1.5949e-03, 1.5949e-03, 1.5949e-03, 1.5949e-03, 1.5949e-03,
          6.0752e-02, 1.5949e-03, 1.5949e-03, 4.3576e-01, 3.8856e-01, 1.5949e-03,
          1.5949e-03, 1.5949e-03, 1.5949e-03, 2.9081e-02, 3.2850e-01, 7.1048e-01,
          1.3253e-01, 8.6943e-03, 4.5486e-01, 7.1971e-01],
         [1.4620e-03, 3.9148e-01, 1.4620e-03, 1.4620e-03, 1.4620e-03, 1.4620e-03,
          1.4620e-03, 1.4620e-03, 1.4620e-03, 1.4620e-03, 1.4620e-03, 1.1630e-01,
          7.2153e-02, 1.4620e-03, 3.9813e-01, 1.4620e-03, 4.9637e-01, 6.0718e-01,
          9.1968e-01, 0.0000e+00, 2.1373e-01, 1.4512e-01],
         [1.5480e-03, 1.5480e-03, 1.5480e-03, 8.1524e-02, 1.5480e-03, 1.5480e-03,
          3.1767e-02, 9.7475e-02, 1.5480e-03, 1.5480e-03, 1.5480e-03, 1.5480e-03,
          1.5480e-03, 1.5480e-03, 7.0748e-01, 1.5480e-03, 5.7327e-01, 5.8839e-01,
          8.8153e-01, 2.1559e-02, 5.8574e-01, 2.8927e-01],
         [1.7544e-03, 1.7544e-03, 3.9870e-01, 2.8864e-01, 1.7544e-03, 1.7544e-03,
          1.7544e-03, 1.7544e-03, 1.7544e-03, 1.7544e-03, 1.3114e-01, 1.5521e-01,
          1.7544e-03, 1.7544e-03, 1.7544e-03, 1.7544e-03, 4.5197e-01, 6.3493e-01,
          8.0321e-01, 1.3462e-04, 5.8883e-01, 3.7335e-01],
         [1.0965e-03, 1.0965e-03, 5.9253e-02, 1.0965e-03, 1.0965e-03, 8.3288e-02,
          1.0965e-03, 1.0965e-03, 1.0965e-03, 1.0965e-03, 1.2537e-01, 1.0965e-03,
          1.0965e-03, 1.1992e-01, 4.6606e-01, 1.3185e-01, 5.4186e-01, 7.6104e-01,
          2.7107e-02, 4.5445e-03, 4.8475e-01, 9.7097e-01]], device='cuda:0'),
 tensor([1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2], device='cuda:0'))

Training Models

I will train three neural networks on the data: 1. A model that only uses the lyrics as features. 2. A model that only uses the engineered features. 3. A model taht uses both the lyrics and the engineered features.

Before building the models, I will define the training and testing loops.

# training loop
learning_rate = 0.01

def train(dataloader, feature_choice = "l", k_epochs = 1, print_every = 50):
    # select model based on feature choice
    if feature_choice == "l": # lyrics only model
      model = lyrics_model
    elif feature_choice == "e": # engineered features only
      model = engineered_model
    elif feature_choice == "b": # both lyrics and engineered features
      model = both_model

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = torch.nn.CrossEntropyLoss()

    log_interval = 300

    for epoch in range(k_epochs): 
      running_loss = 0.0
      for idx, (lyrics, engineered, label) in enumerate(dataloader):
          # for calculating accuracy
          correct, total = 0, 0

          # zero gradients
          optimizer.zero_grad()
          # form prediction on batch, using chosen features and models
          if feature_choice == "l": # lyrics only model
            predicted_label = model(lyrics)
          elif feature_choice == "e": # engineered features only
            predicted_label = model(engineered)
          elif feature_choice == "b": # both lyrics and engineered features
            predicted_label = model(lyrics, engineered)
          
          # evaluate loss on prediction
          loss = loss_fn(predicted_label, label)
          # compute gradient
          loss.backward()
          # take an optimization step
          optimizer.step()

          # update running loss
          running_loss += loss.item()

          # for printing accuracy
          correct += (predicted_label.argmax(1) == label).sum().item()
          total   += label.size(0)

          if idx % print_every == print_every - 1:    
            print(f'[epoch: {epoch + 1}, batches: {idx + 1:5d}], loss: {running_loss / print_every:.3f}, accuracy:{correct/total:.3f}')
            running_loss = 0.0

    print("Finished Training")

# testing loop
def test(dataloader, feature_choice = "l"):
    correct, total = 0, 0

    with torch.no_grad():
        for idx, (lyrics, engineered, label) in enumerate(dataloader):
            # form prediction on batch, using chosen features and models
            if feature_choice == "l": # lyrics only model
              predicted_label = lyrics_model(lyrics)
            elif feature_choice == "e": # engineered features only
              predicted_label = engineered_model(engineered)
            elif feature_choice == "b": # both lyrics and engineered features
              predicted_label = both_model(lyrics, engineered)

            correct += (predicted_label.argmax(1) == label).sum().item()
            total   += label.size(0)

    print(f'Test accuracy: {100 * correct // total} %')

Lyrics Only

I will use a simple model with a word embedding layer to classify music based on only the lyrics.

from torch import nn

# define the model
class LyricsModel(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, max_len, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.fc = nn.Linear(max_len*embedding_dim, num_class)
        
    def forward(self, x):
        x = self.embedding(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return(x)

# instantiate the model
embedding_dim = 3

lyrics_model = LyricsModel(len(vocab), embedding_dim, max_len, 3).to(device)

# train the model
k_epochs = 10

train(train_loader, "l", k_epochs, 200)

[epoch: 1, batches:   200], loss: 1.192, accuracy:0.500
[epoch: 1, batches:   400], loss: 1.039, accuracy:0.312
[epoch: 2, batches:   200], loss: 0.818, accuracy:0.562
[epoch: 2, batches:   400], loss: 0.797, accuracy:0.688
[epoch: 3, batches:   200], loss: 0.628, accuracy:0.688
[epoch: 3, batches:   400], loss: 0.649, accuracy:0.812
[epoch: 4, batches:   200], loss: 0.509, accuracy:0.500
[epoch: 4, batches:   400], loss: 0.542, accuracy:0.938
[epoch: 5, batches:   200], loss: 0.436, accuracy:0.750
[epoch: 5, batches:   400], loss: 0.497, accuracy:0.812
[epoch: 6, batches:   200], loss: 0.409, accuracy:0.750
[epoch: 6, batches:   400], loss: 0.456, accuracy:0.750
[epoch: 7, batches:   200], loss: 0.365, accuracy:0.875
[epoch: 7, batches:   400], loss: 0.410, accuracy:0.688
[epoch: 8, batches:   200], loss: 0.346, accuracy:0.875
[epoch: 8, batches:   400], loss: 0.383, accuracy:0.812
[epoch: 9, batches:   200], loss: 0.312, accuracy:0.875
[epoch: 9, batches:   400], loss: 0.382, accuracy:0.938
[epoch: 10, batches:   200], loss: 0.326, accuracy:0.875
[epoch: 10, batches:   400], loss: 0.363, accuracy:0.750
Finished Training

# test accuracy
test(test_loader, "l")

Test accuracy: 65 %

The test accuracy is lower than the training accuracy. This suggests overfitting.

I will add two dropout layers to the network and see if this reduces overfitting.

class LyricsModelDropout(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, max_len, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.dropout1 = nn.Dropout(0.2)
        self.fc = nn.Linear(3, num_class)
        self.dropout2 = nn.Dropout(0.2)
        
    def forward(self, x):
        x = self.embedding(x)
        x = self.dropout1(x)
        x = x.mean(axis = 1) # take the average across tokens for each embedding dimension
        x = torch.flatten(x, 1)
        x = self.fc(x)
        x = self.dropout2(x)
        return(x)

lyrics_model = LyricsModelDropout(len(vocab), embedding_dim, max_len, 3).to(device)

# train the model
k_epochs = 10

train(train_loader, "l", k_epochs, 200)

[epoch: 1, batches:   200], loss: 0.953, accuracy:0.562
[epoch: 1, batches:   400], loss: 0.896, accuracy:0.562
[epoch: 2, batches:   200], loss: 0.854, accuracy:0.500
[epoch: 2, batches:   400], loss: 0.825, accuracy:0.625
[epoch: 3, batches:   200], loss: 0.743, accuracy:0.750
[epoch: 3, batches:   400], loss: 0.732, accuracy:0.500
[epoch: 4, batches:   200], loss: 0.691, accuracy:0.625
[epoch: 4, batches:   400], loss: 0.676, accuracy:0.750
[epoch: 5, batches:   200], loss: 0.643, accuracy:0.688
[epoch: 5, batches:   400], loss: 0.645, accuracy:0.688
[epoch: 6, batches:   200], loss: 0.617, accuracy:0.875
[epoch: 6, batches:   400], loss: 0.630, accuracy:0.750
[epoch: 7, batches:   200], loss: 0.608, accuracy:0.812
[epoch: 7, batches:   400], loss: 0.612, accuracy:0.875
[epoch: 8, batches:   200], loss: 0.600, accuracy:0.562
[epoch: 8, batches:   400], loss: 0.598, accuracy:0.750
[epoch: 9, batches:   200], loss: 0.591, accuracy:0.875
[epoch: 9, batches:   400], loss: 0.585, accuracy:0.812
[epoch: 10, batches:   200], loss: 0.574, accuracy:0.750
[epoch: 10, batches:   400], loss: 0.583, accuracy:0.688
Finished Training

# test accuracy
test(test_loader, "l")

Test accuracy: 67 %

The dropout layers did reduce overfitting, and the test accuracy improved by 2%.

The base rate for our classification across four genres is 33.3%, so the accuracy of 67% suggests that the model is at least doing one time better than random guessing.

Engineered Only

Next, I will train some models that take in only the engineered features of the songs.

import torch.nn.functional as F

class EngineeredModel(nn.Module):
    
    def __init__(self, num_class):
        super().__init__()
        self.fc1 = nn.Linear(22, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, num_class)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return(x)

engineered_model = EngineeredModel(3).to(device)

learning_rate = 0.0001
k_epochs = 10
train(train_loader, "e", k_epochs, 200)

[epoch: 1, batches:   200], loss: 1.096, accuracy:0.250
[epoch: 1, batches:   400], loss: 0.988, accuracy:0.812
[epoch: 2, batches:   200], loss: 0.860, accuracy:0.750
[epoch: 2, batches:   400], loss: 0.802, accuracy:0.750
[epoch: 3, batches:   200], loss: 0.749, accuracy:0.625
[epoch: 3, batches:   400], loss: 0.713, accuracy:0.688
[epoch: 4, batches:   200], loss: 0.711, accuracy:0.562
[epoch: 4, batches:   400], loss: 0.677, accuracy:0.688
[epoch: 5, batches:   200], loss: 0.671, accuracy:0.562
[epoch: 5, batches:   400], loss: 0.653, accuracy:0.750
[epoch: 6, batches:   200], loss: 0.648, accuracy:0.875
[epoch: 6, batches:   400], loss: 0.643, accuracy:0.750
[epoch: 7, batches:   200], loss: 0.623, accuracy:0.875
[epoch: 7, batches:   400], loss: 0.624, accuracy:0.562
[epoch: 8, batches:   200], loss: 0.608, accuracy:0.688
[epoch: 8, batches:   400], loss: 0.614, accuracy:0.875
[epoch: 9, batches:   200], loss: 0.612, accuracy:0.625
[epoch: 9, batches:   400], loss: 0.608, accuracy:0.562
[epoch: 10, batches:   200], loss: 0.603, accuracy:0.875
[epoch: 10, batches:   400], loss: 0.607, accuracy:0.750
Finished Training

test(test_loader, "e")

Test accuracy: 74 %

The accuracy that we got using engineered feature is higher than what we got with the lyrics. However, it seems that the model was struggling to improve accuracy during training.

Adding another fully connected layer improved the test accuracy for 2%:

class EngineeredModelMoreFc(nn.Module):
    
    def __init__(self, num_class):
        super().__init__()
        self.fc1 = nn.Linear(22, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, num_class)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)

        return(x)

engineered_model = EngineeredModelMoreFc(3).to(device)

train(train_loader, "e", k_epochs, 200)

[epoch: 1, batches:   200], loss: 1.099, accuracy:0.562
[epoch: 1, batches:   400], loss: 0.925, accuracy:0.750
[epoch: 2, batches:   200], loss: 0.792, accuracy:0.688
[epoch: 2, batches:   400], loss: 0.715, accuracy:0.938
[epoch: 3, batches:   200], loss: 0.672, accuracy:0.875
[epoch: 3, batches:   400], loss: 0.661, accuracy:0.938
[epoch: 4, batches:   200], loss: 0.645, accuracy:0.750
[epoch: 4, batches:   400], loss: 0.614, accuracy:0.812
[epoch: 5, batches:   200], loss: 0.611, accuracy:0.750
[epoch: 5, batches:   400], loss: 0.616, accuracy:0.938
[epoch: 6, batches:   200], loss: 0.589, accuracy:0.812
[epoch: 6, batches:   400], loss: 0.599, accuracy:0.750
[epoch: 7, batches:   200], loss: 0.600, accuracy:0.562
[epoch: 7, batches:   400], loss: 0.588, accuracy:0.688
[epoch: 8, batches:   200], loss: 0.587, accuracy:0.688
[epoch: 8, batches:   400], loss: 0.595, accuracy:0.688
[epoch: 9, batches:   200], loss: 0.568, accuracy:0.875
[epoch: 9, batches:   400], loss: 0.589, accuracy:0.750
[epoch: 10, batches:   200], loss: 0.578, accuracy:0.688
[epoch: 10, batches:   400], loss: 0.584, accuracy:0.688
Finished Training

test(test_loader, "e")

Test accuracy: 76 %

Using only one layer, on the other hand, decreased test accuracy:

# one hidden layer with 32 outputs
class EngineeredModel1Fc(nn.Module):
    
    def __init__(self, num_class):
        super().__init__()
        self.fc1 = nn.Linear(22, 32)
        self.fc2 = nn.Linear(32, num_class)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return(x)

engineered_model = EngineeredModel1Fc(3).to(device)

train(train_loader, "e", k_epochs, 200)

[epoch: 1, batches:   200], loss: 1.119, accuracy:0.562
[epoch: 1, batches:   400], loss: 1.056, accuracy:0.438
[epoch: 2, batches:   200], loss: 0.964, accuracy:0.625
[epoch: 2, batches:   400], loss: 0.923, accuracy:0.500
[epoch: 3, batches:   200], loss: 0.889, accuracy:0.438
[epoch: 3, batches:   400], loss: 0.853, accuracy:0.500
[epoch: 4, batches:   200], loss: 0.837, accuracy:0.625
[epoch: 4, batches:   400], loss: 0.811, accuracy:0.688
[epoch: 5, batches:   200], loss: 0.785, accuracy:0.688
[epoch: 5, batches:   400], loss: 0.785, accuracy:0.875
[epoch: 6, batches:   200], loss: 0.744, accuracy:0.750
[epoch: 6, batches:   400], loss: 0.753, accuracy:0.562
[epoch: 7, batches:   200], loss: 0.715, accuracy:0.438
[epoch: 7, batches:   400], loss: 0.726, accuracy:0.688
[epoch: 8, batches:   200], loss: 0.703, accuracy:0.688
[epoch: 8, batches:   400], loss: 0.704, accuracy:0.375
[epoch: 9, batches:   200], loss: 0.681, accuracy:0.688
[epoch: 9, batches:   400], loss: 0.690, accuracy:0.812
[epoch: 10, batches:   200], loss: 0.671, accuracy:0.812
[epoch: 10, batches:   400], loss: 0.666, accuracy:0.688
Finished Training

test(test_loader, "e")

Test accuracy: 72 %

It looks like accuracy increased as the number of layers increased. I will try another model with even more layers.

class EngineeredModelMoreFc(nn.Module):
    
    def __init__(self, num_class):
        super().__init__()
        self.fc1 = nn.Linear(22, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 16)
        self.fc5 = nn.Linear(16, num_class)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.fc5(x)

        return(x)

engineered_model = EngineeredModelMoreFc(3).to(device)

train(train_loader, "e", k_epochs, 200)

[epoch: 1, batches:   200], loss: 1.137, accuracy:0.250
[epoch: 1, batches:   400], loss: 0.958, accuracy:0.625
[epoch: 2, batches:   200], loss: 0.845, accuracy:0.562
[epoch: 2, batches:   400], loss: 0.769, accuracy:0.750
[epoch: 3, batches:   200], loss: 0.733, accuracy:0.562
[epoch: 3, batches:   400], loss: 0.669, accuracy:0.625
[epoch: 4, batches:   200], loss: 0.643, accuracy:0.938
[epoch: 4, batches:   400], loss: 0.636, accuracy:0.625
[epoch: 5, batches:   200], loss: 0.627, accuracy:0.875
[epoch: 5, batches:   400], loss: 0.626, accuracy:0.812
[epoch: 6, batches:   200], loss: 0.605, accuracy:0.625
[epoch: 6, batches:   400], loss: 0.624, accuracy:0.625
[epoch: 7, batches:   200], loss: 0.613, accuracy:0.812
[epoch: 7, batches:   400], loss: 0.612, accuracy:0.562
[epoch: 8, batches:   200], loss: 0.595, accuracy:0.812
[epoch: 8, batches:   400], loss: 0.594, accuracy:0.688
[epoch: 9, batches:   200], loss: 0.589, accuracy:0.812
[epoch: 9, batches:   400], loss: 0.594, accuracy:0.688
[epoch: 10, batches:   200], loss: 0.579, accuracy:0.688
[epoch: 10, batches:   400], loss: 0.591, accuracy:0.875
Finished Training

test(test_loader, "e")

Test accuracy: 75 %

We didn’t get a better accuracy with one more layer. It seems that the benefit of making the model more complex has leveled out.

In conclusion, the best performing model is the one with four fully connected layers, with the test accuracy of 76%.

Lyrics + Engineered Features

Finally, I will try a model that takes in both the lyrics and the engineered features. This model takes in the lyrics data and engineered features separately. The lyrics are passed through an embedding layer and a dropout layer. The engineered features go through three fully-connected layers. Finally, the lyrics and engineered feature outputs are combined into one tensor, and passed through two fully-connected layers and one dropout layer in between.

class CombinedNet(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.fc1 = nn.Linear(22, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 16)
        self.fc4 = nn.Linear(613, 32)
        self.fc5 = nn.Linear(32, num_class)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, x_1, x_2):
        
        # text pipeline
        x_1 = self.embedding(x_1)
        x_1 = self.dropout(x_1)
        x_1 = torch.flatten(x_1, 1)

        # engineered features
        x_2 = F.relu(self.fc1(x_2))
        x_2 = F.relu(self.fc2(x_2))
        x_2 = self.fc3(x_2)

        # ensure that both x_1 and x_2 are 2-d tensors, flattening if necessary
        # then, combine them with: 
        x = torch.cat((x_1, x_2), 1)
        # pass x through a couple more fully-connected layers and return output
        x = F.relu(self.fc4(x))
        x = self.dropout(x)
        x = self.fc5(x)
        return(x)

both_model = CombinedNet(len(vocab), embedding_dim, 3).to(device)

learning_rate = 0.001
k_epochs = 10
train(train_loader, "b", k_epochs, 200)

[epoch: 1, batches:   200], loss: 0.857, accuracy:0.562
[epoch: 1, batches:   400], loss: 0.727, accuracy:0.750
[epoch: 2, batches:   200], loss: 0.702, accuracy:0.750
[epoch: 2, batches:   400], loss: 0.665, accuracy:0.812
[epoch: 3, batches:   200], loss: 0.666, accuracy:0.625
[epoch: 3, batches:   400], loss: 0.652, accuracy:0.812
[epoch: 4, batches:   200], loss: 0.617, accuracy:0.938
[epoch: 4, batches:   400], loss: 0.586, accuracy:0.750
[epoch: 5, batches:   200], loss: 0.562, accuracy:0.812
[epoch: 5, batches:   400], loss: 0.562, accuracy:0.688
[epoch: 6, batches:   200], loss: 0.536, accuracy:0.812
[epoch: 6, batches:   400], loss: 0.548, accuracy:0.625
[epoch: 7, batches:   200], loss: 0.539, accuracy:0.938
[epoch: 7, batches:   400], loss: 0.527, accuracy:0.750
[epoch: 8, batches:   200], loss: 0.521, accuracy:0.688
[epoch: 8, batches:   400], loss: 0.518, accuracy:0.750
[epoch: 9, batches:   200], loss: 0.491, accuracy:0.812
[epoch: 9, batches:   400], loss: 0.506, accuracy:0.812
[epoch: 10, batches:   200], loss: 0.489, accuracy:1.000
[epoch: 10, batches:   400], loss: 0.501, accuracy:0.812
Finished Training

test(test_loader, "b")

Test accuracy: 74 %

The combined model seems to have similar testing performance as the model with only engineered features. Although I added some dropout layers, there is still some overfitting. In the next model, I will try to add more dropout layers.

class CombinedNetMoreDropout(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.fc1 = nn.Linear(22, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 16)
        self.fc4 = nn.Linear(613, 32)
        self.fc5 = nn.Linear(32, num_class)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, x_1, x_2):
        
        # text pipeline
        x_1 = self.embedding(x_1)
        x_1 = self.dropout(x_1)
        x_1 = torch.flatten(x_1, 1)

        # engineered features
        x_2 = F.relu(self.fc1(x_2))
        x_2 = self.dropout(x_2)
        x_2 = F.relu(self.fc2(x_2))
        x_2 = self.dropout(x_2)
        x_2 = self.fc3(x_2)

        # ensure that both x_1 and x_2 are 2-d tensors, flattening if necessary
        # then, combine them with: 
        x = torch.cat((x_1, x_2), 1)
        # pass x through a couple more fully-connected layers and return output
        x = F.relu(self.fc4(x))
        x = self.dropout(x)
        x = self.fc5(x)
        return(x)

both_model = CombinedNet(len(vocab), embedding_dim, 3).to(device)

k_epochs = 10
train(train_loader, "b", k_epochs, 200)

[epoch: 1, batches:   200], loss: 0.826, accuracy:0.875
[epoch: 1, batches:   400], loss: 0.701, accuracy:0.750
[epoch: 2, batches:   200], loss: 0.640, accuracy:0.750
[epoch: 2, batches:   400], loss: 0.607, accuracy:0.750
[epoch: 3, batches:   200], loss: 0.568, accuracy:0.875
[epoch: 3, batches:   400], loss: 0.568, accuracy:0.688
[epoch: 4, batches:   200], loss: 0.523, accuracy:0.625
[epoch: 4, batches:   400], loss: 0.530, accuracy:0.812
[epoch: 5, batches:   200], loss: 0.521, accuracy:0.625
[epoch: 5, batches:   400], loss: 0.504, accuracy:0.688
[epoch: 6, batches:   200], loss: 0.487, accuracy:0.750
[epoch: 6, batches:   400], loss: 0.493, accuracy:0.625
[epoch: 7, batches:   200], loss: 0.450, accuracy:0.875
[epoch: 7, batches:   400], loss: 0.475, accuracy:0.812
[epoch: 8, batches:   200], loss: 0.449, accuracy:0.625
[epoch: 8, batches:   400], loss: 0.467, accuracy:0.875
[epoch: 9, batches:   200], loss: 0.435, accuracy:0.750
[epoch: 9, batches:   400], loss: 0.459, accuracy:0.750
[epoch: 10, batches:   200], loss: 0.424, accuracy:0.750
[epoch: 10, batches:   400], loss: 0.431, accuracy:0.812
Finished Training

test(test_loader, "b")

Test accuracy: 74 %

Summary

In summary, the best test accuracy achieved by our three types of models are: - Lyrics only: 67% - Engineered features only: 76% - Lyrics and engineered features: 74%

All of these models perform better than the base rate of 33.3%, but there is still room for improvement. In general, the engineered features are better predictors of genres than the lyrics. Looking at lyrics in addition to engineered features did not seem to help boost accuracy compared to using the features alone.

Visualzing Word Embeddings

In this section, I will visualize the word embeddings learned by my lyrics model.

# extract the embedding matrix from the lyrics model
embedding_matrix = lyrics_model.embedding.cpu().weight.data.numpy()

# extract words from vocab
tokens = vocab.get_itos()
tokens.append(" ")

# represent embedding in two dimensions
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
weights = pca.fit_transform(embedding_matrix)
weights

array([[ 1.6995059 ,  0.86947083],
       [ 1.9406745 ,  2.9961193 ],
       [-0.12173966,  2.8156223 ],
       ...,
       [ 2.7143688 , -0.7970013 ],
       [-0.44954112, -2.4305842 ],
       [ 0.16685873,  0.17623016]], dtype=float32)

# turn into dataframe
embedding_df = pd.DataFrame({
    "word": tokens, 
    "x0": weights[:,0], 
    "x1": weights[:,1]
})

# visualize
import plotly.express as px 
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import numpy as np

fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_df))),
                 size_max = 10,
                 hover_name = "word")

fig.show()

The lyrics model had training accuracies of around 67%, which suggests that the word embeddings may not be learning as well as we expect.

However, we can still see a dinstinction among the three genres:

Words at the top corner are typical for hip hop lyrics.
Words at the bottom corner, around 0 at the x0 axis, seem to be more related to rock, and have a negative valence, as typical rock music does (e.g., band, pain, desperate, nightmare).
Words at the left corner appear more in country music (e.g., cowboy, road, Texas).

Finally, I will try to visualize any clusterings that the embedding model learned by doing a k-mean clustering analysis.

from sklearn.cluster import KMeans

# perform k-means clustering on the weights matrix
kmeans = KMeans(n_clusters=3, random_state=42).fit(weights)
labels = kmeans.labels_

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

# scatter plot with cluster information
embedding_df = pd.DataFrame({
    "word": tokens, 
    "x0": weights[:,0], 
    "x1": weights[:,1], 
    "cluster": labels
})

fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 color = "cluster",
                 size = list(np.ones(len(embedding_df))),
                 size_max = 10,
                 hover_name = "word")

fig.show()

Looking into the words in the three clusters, it makes sense that the red cluster includes more “country” lyrics, while the blue cluster is hip hop and the yellow cluster is rock.

Conclusion

To conclude, I have experimented with using different features and models to classify music by genre. The best performing model used only engineered features about the audio characteristics of the music and achieved a 76% test accuracy.

Although the models that used lyrics data did not perform better, we see that they have learned valid word embeddings that reflected the styles of the lyrics across the three genres.

For this attempt, I have chosen three genres that are quite distinctive from each other in terms of lyrics and music style. The models may face more challenges with data that include more genres that differ in a more subtle way.