Naive Bayes From Scratch

In an effort to reinforce my understanding of statistical decision making, I wanted to build a naive bayes classifier without using a stats library like Sklearn. In the code that follows, you can see the results of my efforts. I was able to build a classifier that performs as well as sklearn’s GaussianNB classifier.

Importing and Cleaning the Data

I’m using data from the MLB Stats API that I’ve previously collected for a separate, ongoing project. For this example, we’ll use the end speed and spin rate of mlb pitches to classify them as either a fastball or not a fastball. End speed refers to the velocity of the ball as it crosses the plate, and spin rate refers to the revolutions per minute of the ball after it’s released.

If you want to use this data for your own purposes, you can find it on GitHub.

import os
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
sns.set_style('whitegrid')
%config InlineBackend.figure_format = 'retina'

sns.set(color_codes=True)
sns.set(rc={'figure.figsize':(12,10)})
path = os.getcwd().strip('Bayes')+'FetchData/datasets/'

df = pd.read_csv(path+'pitchSpeed_spinRate.csv')
print(df.shape)
df.dropna(inplace=True)

df.head()
(1910986, 3)
endSpeed spinRate pitchType
0 87.7 1926.0 Four-Seam Fastball
1 89.0 1872.0 Four-Seam Fastball
2 83.9 1661.0 Changeup
3 88.3 1934.0 Two-Seam Fastball
4 88.9 1997.0 Four-Seam Fastball

Certain pitches aren’t useful for our purposes.

  • ‘Automatic Balls’ are not actually thrown. They’re used when a pitcher strategically decides to intentionally walk a batter
  • A ‘Pitchout’ is a throw away pitch used when you want to catch a runner stealing
  • An Eephus is a rare and hilarious trick pitch
# drop outlier pitch types
pitchTypes_toDrop = ['Automatic Ball','Pitchout','Eephus']
i = df[df['pitchType'].isin(pitchTypes_toDrop)].index
df.drop(index=i,inplace=True)

df.shape

To make things simpler, we’re only going to classify ‘fastball’ vs ‘not fastball’.

def is_fastball(x):
    if 'Fastball' in x['pitchType']:
        return 'fastball'
    else:
        return 'not fastball'
df['is_fastball'] = df.apply(lambda x: is_fastball(x),axis=1)
df['is_fastball'].value_counts()
not fastball    974122
fastball        873880
Name: is_fastball, dtype: int64

Explore the Data

First, we want to see if our features our normally distributed. That way, we can use the Gaussian Probability Density function to estimate the likelihood of observing a particular value in our feature vectors.

f, axes = plt.subplots(1,2)
sns.distplot(df['endSpeed'],ax=axes[0])
sns.distplot(df['spinRate'],ax=axes[1])
<matplotlib.axes._subplots.AxesSubplot at 0x1a239e8e48>

png

Next, we can use a simple scatter plot to see if our features provide reasonable separation between their respective classes.

fig, ax = plt.subplots()
ax.set(xscale='symlog',yscale='symlog')
sns.scatterplot(x=df['endSpeed'],y=df['spinRate'],hue=df['is_fastball'])
<matplotlib.axes._subplots.AxesSubplot at 0x10c11a4a8>

png

Building our Classifier

bayes_image.png

‘A’ is our class value (fastball vs not fastball) and ‘B’ are our predictive features (end speed and spin rate)

Breaking Down Bayes’ Theorem

We need to calculate the probability for each class, fastball and not fastball, given our input data. Then, we choose the class that has the highest probability. We can ignore the denominator in this case. We only care about which probability is higher, and the denominator would be same value in both equations.

P(A) is our “prior” probability, and it’s very easy to calculate. In our case, it’s the number of fastballs (or non-fastballs) divided by the total pitches in the dataset.

def is_fastball(x):
    if 'Fastball' in x['pitchType']:
        return 'fastball'
    else:
        return 'not fastball'
df['is_fastball'] = df.apply(lambda x: is_fastball(x),axis=1)
df['is_fastball'].value_counts()
not fastball    974122
fastball        873880
Name: is_fastball, dtype: int64
P(B A) is our “liklihood”, and that’s a little more complicated. We need to determine the liklihood of observing oura particular value in our input vectors given a class value. Since our features our roughly normally distributed, we can use the Gaussian Probability Density Function. We only need the mean and standard deviation of a class vector the calculate liklihood.
# Create a function that calculates p(x | y):
def likelihood(x,mean,sigma):
    return np.exp(-(x-mean)**2/(2*sigma**2))*(1/(np.sqrt(2*np.pi)*sigma))

Building our Classifier

This is a factory function that takes in our input vectors (X) and outputs (Y) and calculates the figures we need to classify a pitch. As stated above, we’ll need prior probabilites as well as the mean and the standard deviation for our input feature vectors seperated by class value. The function returns another function. The returning function will only take in the input vectors, and it will return probabilities for each class as a dictionary of NumPy arrays.

def fit(X,Y):
    classes = np.unique(Y)
    num_predictors = X.shape[1]
    
    # params dictionary will store the values we need to predict
    params = {}
    for G in classes:
        class_params = []
        for i in range(num_predictors):
            feature_params = {}
            feature_params['prior'] = len(np.where(Y==G)[0])/len(Y)
            feature_params["mean"] = X[np.where(Y==G)][:,i].mean()
            feature_params["sigma"] = X[np.where(Y==G)][:,i].std()
            
            class_params.append(feature_params)
        params[G]=class_params
    
    # The function that is returned from the factory function retains scope. 
    # So it knows the values calculated above
    def predict(X_test):
        class_probabilites = {}
        for G in classes:
            liklihoods = np.empty(X_test.shape)
            for i in range(liklihoods.shape[1]):
                liklihoods[:,i] = likelihood(X_test[:,i],params[G][i]['mean'],params[G][i]['sigma'])
            out = np.prod(liklihoods,axis=1)
            out*=params[G][0]['prior']
            class_probabilites[G]=out
        return class_probabilites
                
    return predict
y = df['is_fastball'].to_numpy()
x = df[['endSpeed','spinRate']].to_numpy()
nb = fit(x,y)
nb
<function __main__.fit.<locals>.predict(X_test)>
predict = nb(x)
predict
{'fastball': array([2.27745304e-05, 7.18554093e-06, 6.92096463e-07, ...,
        9.13635997e-05, 9.01459308e-05, 1.26118320e-04]),
 'not fastball': array([5.06947230e-06, 2.62086247e-06, 8.21903785e-06, ...,
        1.17705429e-05, 1.24653868e-05, 1.38775678e-05])}
predicted_fastballs = predict['fastball']>predict['not fastball']
predicted_fastballs
array([ True,  True, False, ...,  True,  True,  True])

Evaluating the Results

First, we’ll use sklearn’s classification report to get a high level summary of how our classifier did when it was trained on the whole dataset.

from sklearn.metrics import classification_report
pd.DataFrame(classification_report(y == "fastball",predicted_fastballs,output_dict=True))
False True accuracy macro avg weighted avg
precision 0.921015 0.793438 0.850679 8.572267e-01 8.606868e-01
recall 0.783955 0.925057 0.850679 8.545062e-01 8.506793e-01
f1-score 0.846976 0.854207 0.850679 8.505918e-01 8.503957e-01
support 974122.000000 873880.000000 0.850679 1.848002e+06 1.848002e+06

It looks pretty good! But just to make sure, let’s split the data into train and test sets and compare the results versus sklearn.

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
#split the data
X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=.3, random_state=0)

#fit
gnb = GaussianNB()
sk_fit = gnb.fit(X_train,y_train)
#predict
sk_y_pred = sk_fit.predict(X_test)

pd.DataFrame(classification_report(y_test,sk_y_pred,output_dict=True))
fastball not fastball accuracy macro avg weighted avg
precision 0.793633 0.920872 0.850783 0.857252 0.860725
recall 0.924814 0.784416 0.850783 0.854615 0.850783
f1-score 0.854217 0.847184 0.850783 0.850700 0.850508
support 262070.000000 292331.000000 0.850783 554401.000000 554401.000000
#fit
my_fit = fit(X_train,y_train)

#predict
my_y_pred = my_fit(X_test)
my_y_pred = my_y_pred['fastball']>my_y_pred['not fastball']

pd.DataFrame(classification_report(y_test=='fastball',my_y_pred,output_dict=True))
False True accuracy macro avg weighted avg
precision 0.920872 0.793633 0.850783 0.857252 0.860725
recall 0.784416 0.924814 0.850783 0.854615 0.850783
f1-score 0.847184 0.854217 0.850783 0.850700 0.850508
support 292331.000000 262070.000000 0.850783 554401.000000 554401.000000

Conclusion

Our homemade Bayes’ classifier returned the same result as sklearn, so I’m happy with this. It works reasonably well in this situation because our input features are roughly normally distributed, and there’s good separation between class values. Naive Bayes is also good choice in this particular situation because of how many records we have. Naive Bayes doesn’t take very long to produce results because we only need a few data points (mean, standard deviation, and prior probabilities) upon which to form our predictions.