Naive Bayes¶
Naive Bayes is a classic. It can be implemented easily with a few lines of code using one or the other library. Generated by ChatGPT in seconds. But wouldn't it be more interesting to write it from scratch?
Naive Bayes is based on Bayes theorem which is 250 years old. How difficult is it actually to apply it? How much time would it take to implement it in plain vanilla code? Assuming we have basic understanding of probability, and know that $P(A)$ is probability of an event A and $P(A|B)$ is probability of an event A given event B happened it should not be difficult.
Theory¶
Bayes theorem says, that $P(A|B) = P(B|A)*P(A)/P(B)$. Which simply means, that the probability of an event A given B can be calculated from the probability of B given A.
If we now think in terms of a Spam Filter, which is a classic application of Naive Bayes, it means that the probability email is a spam given a specific word is found in the email, can be calculated from the probability of a given word being used in a spam message. That's very nice because the latter can be simply collected from the data, if you have enough emails correctly marked as spam.
Bayes theorem is a foundation, but it cannot be used directly. There are a few little tricks, which we need first before we can implement it.
Prediction¶
In order to predict if something belongs to class S or not (email is a spam or ham), based on events W (word used), we need to see if:
$P( S | w_{1}, w_{2},...,w_{n}) > P( H | w_{1}, w_{2},...,w_{n})$, where S means is a spam, and H is ham
From Bayes we can say that it is the same as:
$$P( w_{1}, w_{2},...,w_{n} | S) * P(S) / P (w_{1}, w_{2},...,w_{n}) > P( w_{1}, w_{2},...,w_{n} | H) * P(H) / P (w_{1}, w_{2},...,w_{n})$$
But since the denominator ($P (w_{1}, w_{2},...,w_{n})$) is the same on both sides, it equals to :
$$P( w_{1}, w_{2},...,w_{n} | S) * P(S) > P( w_{1}, w_{2},...,w_{n}) * P(H)$$
It would still be cumbersome to calculate, because of the various combinations of $w_{n}$ and need of $P(S)$ and $P(H)$ So we will make it Naive, by making 2 assumptions, which simplify it and do not change the final result.
First we assume that words occurring in a message are independent events. This gives us:
$$P(S) * P(w_{1}|S) * P(w_{2}|S) * ... * P(w_{n}|S) > P(H) * P(w_{1}|H) * P(w_{2}|H) * ... * P(w_{n}|H)$$
and then we assume that $P(S)$ and $P(H)$ are equally probable, which leads to:
$$P(w_{1}|S) * P(w_{2}|S) * ... * P(w_{n}|S) > P(H) * P(w_{1}|H) * P(w_{2}|H) * ... * P(w_{n}|H)$$
In other words, we only need to know frequencies of words occurring in spam and ham messages. Something we can collect from the data.
Then to make the calculations stable and easier to code, we apply two tricks. What we need is to avoid zeros, and very small values.
Zeros: Imagine $P(w_{n} | S) = 0$ for some $w_{n}$, then the whole product would be 0. With a lot of events (words) it would happen often enough to make our classifier useless very often. To avoid it, we make sure every word gets $+1$ number of occurrences when counting them.
Very Small Values: Product of very small values ( and $P(w_{n} | S)$ for a single word can be really small) could get too small for our computer to handle. So instead of product of probabilities we do sum of log of probabilities. And we can do it because $ln( A * B) = ln(A) + ln(B)$
And now we have something what is ready to be implemented easily. Which is:
$$ln(P(w_{1}|S)) + ln(P(w_{2}|S)) + ... + ln(P(w_{n}|S)) > ln(P(w_{1}|H)) + ln(P(w_{2}|H)) + ... + ln(P(w_{n}|H))$$
where $P(w_{x}|S)$ is a probability of having some word in a spam or ham ($P(w_{x}|H)$)message. Which is just matter of counting occurrences in collected data...
The Code¶
In order to keep the focus we split the code into data handling (test data preparation) and actual filter.
Test Data Generation¶
The data could be downloaded from the web or exported from some system you are interested in. What is important here is that the format matches our NB filter.
I chose to generate very simple test data. Since it is not the core of the story here I let AI to generate the code of those three functions and put them in a separate class. It makes use of faker so you would need to install it first.
import random
from datetime import datetime, timedelta
from faker import Faker
class DataGen:
_fake = Faker()
def generate_spam_message(self):
spam_templates = [
"Congratulations! You've won a ${} gift card. Click here: {}",
"Earn $$$ working from home. Limited spots available. Apply now: {}",
"You have a pending transaction of ${}. Claim it now: {}",
"Your account has been compromised. Verify now: {}",
"Get the best deals on {}. Don't miss out!"
]
# Randomize placeholders
template = random.choice(spam_templates)
return template.format(
random.randint(10, 1000), # Random dollar amount
self._fake.url(), # Random URL
self._fake.word() # Random product/service
)
def generate_ham_message(self):
ham_templates = [
"Hey {}, are we still on for {}?",
"Can you send me the files by {}?",
"Meeting rescheduled to {}. Let me know if it works for you.",
"Happy birthday, {}! Have a great day!",
"Thanks for your help with the {} project."
]
# Randomize placeholders
template = random.choice(ham_templates)
return template.format(
self._fake.first_name(), # Random name
self._fake.date_time(None, timedelta(days=14)),
self._fake.word() # Random project/task
)
def generate_data(self, num_spam=50, num_ham=50):
data = []
# Generate spam messages
for _ in range(num_spam):
data.append((self.generate_spam_message(), True))
# Generate ham messages
for _ in range(num_ham):
data.append((self.generate_ham_message(), False))
# Shuffle the data
random.shuffle(data)
return data
Naive Bayes Filter¶
Finally, we came to the filter itself. Implementation of the Naive Bayes filter. Seeking the answer to the original question, how easy would it be to build it without ML library, we implement it in plain vanilla code. It uses nltk package to make tokenization of a message easier, but the rest is just a code, without any fancy ML libraries.
We end up with two methods, to train the filter and to check if a given text is a spam.
Training of the filter means counting occurrences of a given word in spam or ham message, and calculating a probability of it.
Prediction (is_spam check) compares the probabilities according to the formula above.
import math
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
# Download necessary NLTK resources
from nltk.corpus import stopwords as sw
from nltk.tokenize import word_tokenize as wt
from nltk.probability import FreqDist as fd
class NBFilter:
_spam_freqs = {}
_ham_freqs = {}
_swords = set(sw.words('english'))
def train(self, a_train_data):
my_all_words = set()
my_spam_words = {}
my_ham_words = {}
for my_tmp_mesg in a_train_data:
my_wordlist = [_.lower() for _ in wt(my_tmp_mesg[0]) if _.lower() not in self._swords and _.isalnum()]
my_freq_dist = fd(my_wordlist)
for my_tmp_freq in my_freq_dist.keys():
my_all_words.add(my_tmp_freq)
if my_tmp_mesg[1]:
if my_tmp_freq not in my_spam_words:
my_spam_words[my_tmp_freq] = 1
my_spam_words[my_tmp_freq] = my_spam_words[my_tmp_freq] + my_freq_dist[my_tmp_freq]
else:
if my_tmp_freq not in my_ham_words:
my_ham_words[my_tmp_freq] = 1
my_ham_words[my_tmp_freq] = my_ham_words[my_tmp_freq] + my_freq_dist[my_tmp_freq]
self._spam_freqs = {k: v / len(my_all_words) for k, v in my_spam_words.items()}
self._ham_freqs = {k: v / len(my_all_words) for k, v in my_ham_words.items()}
def is_spam(self, a_body):
my_tokens = [_.lower() for _ in wt(a_body) if _.lower() not in self._swords and _.isalnum()]
my_spam_p, my_ham_p = None, None
for my_tmp_word in my_tokens:
if my_tmp_word in self._spam_freqs:
if my_spam_p is None:
my_spam_p = math.log(self._spam_freqs[my_tmp_word])
else:
my_spam_p = my_spam_p + math.log(self._spam_freqs[my_tmp_word])
if my_tmp_word in self._ham_freqs:
if my_ham_p is None:
my_ham_p = math.log(self._ham_freqs[my_tmp_word])
else:
my_ham_p = my_ham_p + math.log(self._ham_freqs[my_tmp_word])
# just a some very small value if nothing found
if my_spam_p is None:
my_spam_p = math.log(0.00000001)
if my_ham_p is None:
my_ham_p = math.log(0.00000001)
return my_spam_p > my_ham_p
Putting it all together¶
There is not much to comment here.
What I personally find interesting is, that this very naive implementation of a Naive Bayes, with minimum effort and maximum understanding, provides surprisingly good accuracies.
from nbfilter import NBFilter
from data_gen import DataGen
# Generating Train and Test data
my_data_gen = DataGen()
my_train_data = my_data_gen.generate_data()
my_test_data = my_data_gen.generate_data()
# Creating and trainig a filter
my_filter = NBFilter()
my_filter.train(my_train_data)
# Applying NB Filter message by message
my_cnt_ok = 0
for my_tmp_mesg in my_test_data:
if my_tmp_mesg[1] == my_filter.is_spam(my_tmp_mesg[0]):
my_cnt_ok += 1
print(f"accurracy: {my_cnt_ok/len(my_test_data)}")