• United States+1
  • United Kingdom+44
  • Afghanistan (‫افغانستان‬‎)+93
  • Albania (Shqipëri)+355
  • Algeria (‫الجزائر‬‎)+213
  • American Samoa+1684
  • Andorra+376
  • Angola+244
  • Anguilla+1264
  • Antigua and Barbuda+1268
  • Argentina+54
  • Armenia (Հայաստան)+374
  • Aruba+297
  • Australia+61
  • Austria (Österreich)+43
  • Azerbaijan (Azərbaycan)+994
  • Bahamas+1242
  • Bahrain (‫البحرين‬‎)+973
  • Bangladesh (বাংলাদেশ)+880
  • Barbados+1246
  • Belarus (Беларусь)+375
  • Belgium (België)+32
  • Belize+501
  • Benin (Bénin)+229
  • Bermuda+1441
  • Bhutan (འབྲུག)+975
  • Bolivia+591
  • Bosnia and Herzegovina (Босна и Херцеговина)+387
  • Botswana+267
  • Brazil (Brasil)+55
  • British Indian Ocean Territory+246
  • British Virgin Islands+1284
  • Brunei+673
  • Bulgaria (България)+359
  • Burkina Faso+226
  • Burundi (Uburundi)+257
  • Cambodia (កម្ពុជា)+855
  • Cameroon (Cameroun)+237
  • Canada+1
  • Cape Verde (Kabu Verdi)+238
  • Caribbean Netherlands+599
  • Cayman Islands+1345
  • Central African Republic (République centrafricaine)+236
  • Chad (Tchad)+235
  • Chile+56
  • China (中国)+86
  • Christmas Island+61
  • Cocos (Keeling) Islands+61
  • Colombia+57
  • Comoros (‫جزر القمر‬‎)+269
  • Congo (DRC) (Jamhuri ya Kidemokrasia ya Kongo)+243
  • Congo (Republic) (Congo-Brazzaville)+242
  • Cook Islands+682
  • Costa Rica+506
  • Côte d’Ivoire+225
  • Croatia (Hrvatska)+385
  • Cuba+53
  • Curaçao+599
  • Cyprus (Κύπρος)+357
  • Czech Republic (Česká republika)+420
  • Denmark (Danmark)+45
  • Djibouti+253
  • Dominica+1767
  • Dominican Republic (República Dominicana)+1
  • Ecuador+593
  • Egypt (‫مصر‬‎)+20
  • El Salvador+503
  • Equatorial Guinea (Guinea Ecuatorial)+240
  • Eritrea+291
  • Estonia (Eesti)+372
  • Ethiopia+251
  • Falkland Islands (Islas Malvinas)+500
  • Faroe Islands (Føroyar)+298
  • Fiji+679
  • Finland (Suomi)+358
  • France+33
  • French Guiana (Guyane française)+594
  • French Polynesia (Polynésie française)+689
  • Gabon+241
  • Gambia+220
  • Georgia (საქართველო)+995
  • Germany (Deutschland)+49
  • Ghana (Gaana)+233
  • Gibraltar+350
  • Greece (Ελλάδα)+30
  • Greenland (Kalaallit Nunaat)+299
  • Grenada+1473
  • Guadeloupe+590
  • Guam+1671
  • Guatemala+502
  • Guernsey+44
  • Guinea (Guinée)+224
  • Guinea-Bissau (Guiné Bissau)+245
  • Guyana+592
  • Haiti+509
  • Honduras+504
  • Hong Kong (香港)+852
  • Hungary (Magyarország)+36
  • Iceland (Ísland)+354
  • India (भारत)+91
  • Indonesia+62
  • Iran (‫ایران‬‎)+98
  • Iraq (‫العراق‬‎)+964
  • Ireland+353
  • Isle of Man+44
  • Israel (‫ישראל‬‎)+972
  • Italy (Italia)+39
  • Jamaica+1876
  • Japan (日本)+81
  • Jersey+44
  • Jordan (‫الأردن‬‎)+962
  • Kazakhstan (Казахстан)+7
  • Kenya+254
  • Kiribati+686
  • Kosovo+383
  • Kuwait (‫الكويت‬‎)+965
  • Kyrgyzstan (Кыргызстан)+996
  • Laos (ລາວ)+856
  • Latvia (Latvija)+371
  • Lebanon (‫لبنان‬‎)+961
  • Lesotho+266
  • Liberia+231
  • Libya (‫ليبيا‬‎)+218
  • Liechtenstein+423
  • Lithuania (Lietuva)+370
  • Luxembourg+352
  • Macau (澳門)+853
  • Macedonia (FYROM) (Македонија)+389
  • Madagascar (Madagasikara)+261
  • Malawi+265
  • Malaysia+60
  • Maldives+960
  • Mali+223
  • Malta+356
  • Marshall Islands+692
  • Martinique+596
  • Mauritania (‫موريتانيا‬‎)+222
  • Mauritius (Moris)+230
  • Mayotte+262
  • Mexico (México)+52
  • Micronesia+691
  • Moldova (Republica Moldova)+373
  • Monaco+377
  • Mongolia (Монгол)+976
  • Montenegro (Crna Gora)+382
  • Montserrat+1664
  • Morocco (‫المغرب‬‎)+212
  • Mozambique (Moçambique)+258
  • Myanmar (Burma) (မြန်မာ)+95
  • Namibia (Namibië)+264
  • Nauru+674
  • Nepal (नेपाल)+977
  • Netherlands (Nederland)+31
  • New Caledonia (Nouvelle-Calédonie)+687
  • New Zealand+64
  • Nicaragua+505
  • Niger (Nijar)+227
  • Nigeria+234
  • Niue+683
  • Norfolk Island+672
  • North Korea (조선 민주주의 인민 공화국)+850
  • Northern Mariana Islands+1670
  • Norway (Norge)+47
  • Oman (‫عُمان‬‎)+968
  • Pakistan (‫پاکستان‬‎)+92
  • Palau+680
  • Palestine (‫فلسطين‬‎)+970
  • Panama (Panamá)+507
  • Papua New Guinea+675
  • Paraguay+595
  • Peru (Perú)+51
  • Philippines+63
  • Poland (Polska)+48
  • Portugal+351
  • Puerto Rico+1
  • Qatar (‫قطر‬‎)+974
  • Réunion (La Réunion)+262
  • Romania (România)+40
  • Russia (Россия)+7
  • Rwanda+250
  • Saint Barthélemy (Saint-Barthélemy)+590
  • Saint Helena+290
  • Saint Kitts and Nevis+1869
  • Saint Lucia+1758
  • Saint Martin (Saint-Martin (partie française))+590
  • Saint Pierre and Miquelon (Saint-Pierre-et-Miquelon)+508
  • Saint Vincent and the Grenadines+1784
  • Samoa+685
  • San Marino+378
  • São Tomé and Príncipe (São Tomé e Príncipe)+239
  • Saudi Arabia (‫المملكة العربية السعودية‬‎)+966
  • Senegal (Sénégal)+221
  • Serbia (Србија)+381
  • Seychelles+248
  • Sierra Leone+232
  • Singapore+65
  • Sint Maarten+1721
  • Slovakia (Slovensko)+421
  • Slovenia (Slovenija)+386
  • Solomon Islands+677
  • Somalia (Soomaaliya)+252
  • South Africa+27
  • South Korea (대한민국)+82
  • South Sudan (‫جنوب السودان‬‎)+211
  • Spain (España)+34
  • Sri Lanka (ශ්‍රී ලංකාව)+94
  • Sudan (‫السودان‬‎)+249
  • Suriname+597
  • Svalbard and Jan Mayen+47
  • Swaziland+268
  • Sweden (Sverige)+46
  • Switzerland (Schweiz)+41
  • Syria (‫سوريا‬‎)+963
  • Taiwan (台灣)+886
  • Tajikistan+992
  • Tanzania+255
  • Thailand (ไทย)+66
  • Timor-Leste+670
  • Togo+228
  • Tokelau+690
  • Tonga+676
  • Trinidad and Tobago+1868
  • Tunisia (‫تونس‬‎)+216
  • Turkey (Türkiye)+90
  • Turkmenistan+993
  • Turks and Caicos Islands+1649
  • Tuvalu+688
  • U.S. Virgin Islands+1340
  • Uganda+256
  • Ukraine (Україна)+380
  • United Arab Emirates (‫الإمارات العربية المتحدة‬‎)+971
  • United Kingdom+44
  • United States+1
  • Uruguay+598
  • Uzbekistan (Oʻzbekiston)+998
  • Vanuatu+678
  • Vatican City (Città del Vaticano)+39
  • Venezuela+58
  • Vietnam (Việt Nam)+84
  • Wallis and Futuna+681
  • Western Sahara (‫الصحراء الغربية‬‎)+212
  • Yemen (‫اليمن‬‎)+967
  • Zambia+260
  • Zimbabwe+263
  • Åland Islands+358
Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

Real Time Text Classification Using Kafka and Scikit-learn

Vipul Vaibhaw

Artificial Intelligence / Machine Learning

Introduction:

Text classification is one of the essential tasks in supervised machine learning (ML). Assigning categories to text, which can be tweets, Facebook posts, web page, library book, media articles, gallery, etc. has many applications like spam filtering, sentiment analysis, etc. In this blog, we build a text classification engine to classify topics in an incoming Twitter stream using Apache Kafka and scikit-learn - a Python based Machine Learning Library.

Let's dive into the details. Here is a diagram to explain visually the components and data flow. The Kafka producer will ingest data from Twitter and send it to Kafka broker. The Kafka consumer will ask the Kafka broker for the tweets. We convert the tweets binary stream from Kafka to human readable strings and perform predictions using saved models. We train the models using Twenty Newsgroups which is a prebuilt training data from Sci-kit. It is a standard data set used for training classification algorithms. 

Training Classification Algorithm

In this blog we will use the following machine learning models:

We have used the following libraries/tools:

  • tweepy - Twitter library for python
  • Apache Kafka
  • scikit-learn
  • pickle - Python Object serialization library

Let’s first understand the following key concepts:

  • Word to Vector Methodology (Word2Vec)
  • Bag-of-Words
  • tf-idf
  • Multinomial Naive Bayes classifier

Word2Vec methodology

One of the key ideas in Natural Language Processing(NLP) is how we can efficiently convert words into numeric vectors which can then be given as an input to machine learning models to perform predictions.

Neural networks or any other machine learning models are nothing but mathematical functions which need numbers or vectors to churn out the output except tree based methods, they can work on words.

For this we have an approach known as Word2Vec. A very trivial solution to this would be to use “one-hot” method of converting the word into a sparse matrix with only one element of the vector set to 1, the rest being zero.

For example, “the apple a day the good” would have following representation

Word2Vec Methodology Example

Here we have transformed the above sentence into a 6×5 matrix, with the 5 being the size of the vocabulary as “the” is repeated. But what are we supposed to do when we have a gigantic dictionary to learn from say more than 100000 words? Here one hot encoding fails. In one hot encoding the relationship between the words is lost. Like “Lanka” should come after “Sri”.

Here is where Word2Vec comes in. Our goal is to vectorize the words while maintaining the context. Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. 

Tf-idf (term frequency–inverse document frequency)

TF-IDF is a statistic which determines how important is a word to the document in given corpus. Variations of tf-idf is used by search engines, for text summarizations etc. You can read more about tf-idf - here.

Multinomial Naive Bayes classifier

Naive Bayes Classifier comes from family of probabilistic classifiers based on Bayes theorem. We use it to classify spam or not spam, sports or politics etc. We are going to use this for classifying streams of tweets coming in. You can explore it - here.

Lets how they fit in together.

Multinomial Naive Bayes classifier

The data from the "20 newsgroups datasets" is completely in text format. We cannot feed it directly to any model to do mathematical calculations. We have to extract features from the datasets and have to convert them to numbers which a model can ingest and then produce an output.
So, we use Continuous Bag of Words and tf-idf for extracting features from datasets and then ingest them to multinomial naive bayes classifier to get predictions.

1. Train Your Model

We are going to use this dataset. We create another file and import the needed libraries We are using sklearn for ML and pickle to save trained model. Now we define the model.

from __future__ import division,print_function, absolute_import
from sklearn.datasets import fetch_20newsgroups #built-in dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import pickle
from kafka import KafkaConsumer
#Defining model and training it
categories = ["talk.politics.misc","misc.forsale","rec.motorcycles",\
"comp.sys.mac.hardware","sci.med","talk.religion.misc"] #http://qwone.com/~jason/20Newsgroups/ for reference
def fetch_train_dataset(categories):
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
return twenty_train
def bag_of_words(categories):
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(fetch_train_dataset(categories).data)
pickle.dump(count_vect.vocabulary_, open("vocab.pickle", 'wb'))
return X_train_counts
def tf_idf(categories):
tf_transformer = TfidfTransformer()
return (tf_transformer,tf_transformer.fit_transform(bag_of_words(categories)))
def model(categories):
clf = MultinomialNB().fit(tf_idf(categories)[1], fetch_train_dataset(categories).target)
return clf
model = model(categories)
pickle.dump(model,open("model.pickle", 'wb'))
print("Training Finished!")
#Training Finished Here
view raw model.sh hosted with ❤ by GitHub

2. The Kafka Tweet Producer

We have the trained model in place. Now lets get the real time stream of Twitter via Kafka. We define the Producer.

# import required libraries
from kafka import SimpleProducer, KafkaClient
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from twitter_config import consumer_key, consumer_secret, access_token, access_token_secret
import json
view raw kafka_tweet.sh hosted with ❤ by GitHub

Now we will define Kafka settings and will create KafkaPusher Class. This is necessary because we need to send the data coming from tweepy stream to Kafka producer.

# Kafka settings
topic = b'twitter-stream'
# setting up Kafka producer
kafka = KafkaClient('localhost:9092')
producer = SimpleProducer(kafka)
class KafkaPusher(StreamListener):
def on_data(self, data):
all_data = json.loads(data)
tweet = all_data["text"]
producer.send_messages(topic, tweet.encode('utf-8'))
return True
def on_error(self, status):
print statusWORDS_TO_TRACK = ["Politics","Apple","Google","Microsoft","Bikes","Harley Davidson","Medicine"]
if __name__ == '__main__':
l = KafkaPusher()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
while True:
try:
stream.filter(languages=["en"], track=WORDS_TO_TRACK)
except:
pass

Note - You need to start Kafka server before running this script.

3. Loading your model for predictions

Now we have the trained model in step 1 and a twitter stream in step 2. Lets use the model now to do actual predictions. The first step is to load the model:

#Loading model and vocab
print("Loading pre-trained model")
vocabulary_to_load = pickle.load(open("vocab.pickle", 'rb'))
count_vect = CountVectorizer(vocabulary=vocabulary_to_load)
load_model = pickle.load(open("model.pickle", 'rb'))count_vect._validate_vocabulary()
tfidf_transformer = tf_idf(categories)[0]

Then we start the kafka consumer and begin predictions:

#predicting the streaming kafka messages
consumer = KafkaConsumer('twitter-stream',bootstrap_servers=\['localhost:9092'])
print("Starting ML predictions.")
for message in consumer:
X_new_counts = count_vect.transform([message.value])
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = load_model.predict(X_new_tfidf)
print(message.value+" => "+fetch_train_dataset(categories).target_names[predicted[0]])

Following are some of the classification done by our model

  • RT @amazingatheist: Making fun of kids who survived a school shooting just days after the event because you disagree with their politics is… => talk.politics.misc
  • sci.med
  • RT @DavidKlion: Apropos of that D'Souza tweet; I think in order to make sense of our politics, you need to understand that there are some t… => talk.politics.misc
  • RT @BeauWillimon: These students have already cemented a place in history with their activism, and they’re just getting started. No one wil… => talk.politics.misc
  • RT @byedavo: Cause we ain’t got no president => talk.politics.misc
  • RT @appleinsider: .@Apple reportedly in talks to buy cobalt, key Li-ion battery ingredient, directly from miners … => comp.sys.mac.hardware

Here is the link to the complete git repository

Conclusion:

In this blog, we were successful in creating a data pipeline where we were using the Naive Bayes model for doing classification of the streaming twitter data. We can classify other sources of data like news articles, blog posts etc. Do let us know if you have any questions, queries and additional thoughts in the comments section below.

Happy coding!

Get the latest engineering blogs delivered straight to your inbox.
No spam. Only expert insights.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Real Time Text Classification Using Kafka and Scikit-learn

Introduction:

Text classification is one of the essential tasks in supervised machine learning (ML). Assigning categories to text, which can be tweets, Facebook posts, web page, library book, media articles, gallery, etc. has many applications like spam filtering, sentiment analysis, etc. In this blog, we build a text classification engine to classify topics in an incoming Twitter stream using Apache Kafka and scikit-learn - a Python based Machine Learning Library.

Let's dive into the details. Here is a diagram to explain visually the components and data flow. The Kafka producer will ingest data from Twitter and send it to Kafka broker. The Kafka consumer will ask the Kafka broker for the tweets. We convert the tweets binary stream from Kafka to human readable strings and perform predictions using saved models. We train the models using Twenty Newsgroups which is a prebuilt training data from Sci-kit. It is a standard data set used for training classification algorithms. 

Training Classification Algorithm

In this blog we will use the following machine learning models:

We have used the following libraries/tools:

  • tweepy - Twitter library for python
  • Apache Kafka
  • scikit-learn
  • pickle - Python Object serialization library

Let’s first understand the following key concepts:

  • Word to Vector Methodology (Word2Vec)
  • Bag-of-Words
  • tf-idf
  • Multinomial Naive Bayes classifier

Word2Vec methodology

One of the key ideas in Natural Language Processing(NLP) is how we can efficiently convert words into numeric vectors which can then be given as an input to machine learning models to perform predictions.

Neural networks or any other machine learning models are nothing but mathematical functions which need numbers or vectors to churn out the output except tree based methods, they can work on words.

For this we have an approach known as Word2Vec. A very trivial solution to this would be to use “one-hot” method of converting the word into a sparse matrix with only one element of the vector set to 1, the rest being zero.

For example, “the apple a day the good” would have following representation

Word2Vec Methodology Example

Here we have transformed the above sentence into a 6×5 matrix, with the 5 being the size of the vocabulary as “the” is repeated. But what are we supposed to do when we have a gigantic dictionary to learn from say more than 100000 words? Here one hot encoding fails. In one hot encoding the relationship between the words is lost. Like “Lanka” should come after “Sri”.

Here is where Word2Vec comes in. Our goal is to vectorize the words while maintaining the context. Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. 

Tf-idf (term frequency–inverse document frequency)

TF-IDF is a statistic which determines how important is a word to the document in given corpus. Variations of tf-idf is used by search engines, for text summarizations etc. You can read more about tf-idf - here.

Multinomial Naive Bayes classifier

Naive Bayes Classifier comes from family of probabilistic classifiers based on Bayes theorem. We use it to classify spam or not spam, sports or politics etc. We are going to use this for classifying streams of tweets coming in. You can explore it - here.

Lets how they fit in together.

Multinomial Naive Bayes classifier

The data from the "20 newsgroups datasets" is completely in text format. We cannot feed it directly to any model to do mathematical calculations. We have to extract features from the datasets and have to convert them to numbers which a model can ingest and then produce an output.
So, we use Continuous Bag of Words and tf-idf for extracting features from datasets and then ingest them to multinomial naive bayes classifier to get predictions.

1. Train Your Model

We are going to use this dataset. We create another file and import the needed libraries We are using sklearn for ML and pickle to save trained model. Now we define the model.

from __future__ import division,print_function, absolute_import
from sklearn.datasets import fetch_20newsgroups #built-in dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import pickle
from kafka import KafkaConsumer
#Defining model and training it
categories = ["talk.politics.misc","misc.forsale","rec.motorcycles",\
"comp.sys.mac.hardware","sci.med","talk.religion.misc"] #http://qwone.com/~jason/20Newsgroups/ for reference
def fetch_train_dataset(categories):
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
return twenty_train
def bag_of_words(categories):
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(fetch_train_dataset(categories).data)
pickle.dump(count_vect.vocabulary_, open("vocab.pickle", 'wb'))
return X_train_counts
def tf_idf(categories):
tf_transformer = TfidfTransformer()
return (tf_transformer,tf_transformer.fit_transform(bag_of_words(categories)))
def model(categories):
clf = MultinomialNB().fit(tf_idf(categories)[1], fetch_train_dataset(categories).target)
return clf
model = model(categories)
pickle.dump(model,open("model.pickle", 'wb'))
print("Training Finished!")
#Training Finished Here
view raw model.sh hosted with ❤ by GitHub

2. The Kafka Tweet Producer

We have the trained model in place. Now lets get the real time stream of Twitter via Kafka. We define the Producer.

# import required libraries
from kafka import SimpleProducer, KafkaClient
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from twitter_config import consumer_key, consumer_secret, access_token, access_token_secret
import json
view raw kafka_tweet.sh hosted with ❤ by GitHub

Now we will define Kafka settings and will create KafkaPusher Class. This is necessary because we need to send the data coming from tweepy stream to Kafka producer.

# Kafka settings
topic = b'twitter-stream'
# setting up Kafka producer
kafka = KafkaClient('localhost:9092')
producer = SimpleProducer(kafka)
class KafkaPusher(StreamListener):
def on_data(self, data):
all_data = json.loads(data)
tweet = all_data["text"]
producer.send_messages(topic, tweet.encode('utf-8'))
return True
def on_error(self, status):
print statusWORDS_TO_TRACK = ["Politics","Apple","Google","Microsoft","Bikes","Harley Davidson","Medicine"]
if __name__ == '__main__':
l = KafkaPusher()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
while True:
try:
stream.filter(languages=["en"], track=WORDS_TO_TRACK)
except:
pass

Note - You need to start Kafka server before running this script.

3. Loading your model for predictions

Now we have the trained model in step 1 and a twitter stream in step 2. Lets use the model now to do actual predictions. The first step is to load the model:

#Loading model and vocab
print("Loading pre-trained model")
vocabulary_to_load = pickle.load(open("vocab.pickle", 'rb'))
count_vect = CountVectorizer(vocabulary=vocabulary_to_load)
load_model = pickle.load(open("model.pickle", 'rb'))count_vect._validate_vocabulary()
tfidf_transformer = tf_idf(categories)[0]

Then we start the kafka consumer and begin predictions:

#predicting the streaming kafka messages
consumer = KafkaConsumer('twitter-stream',bootstrap_servers=\['localhost:9092'])
print("Starting ML predictions.")
for message in consumer:
X_new_counts = count_vect.transform([message.value])
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = load_model.predict(X_new_tfidf)
print(message.value+" => "+fetch_train_dataset(categories).target_names[predicted[0]])

Following are some of the classification done by our model

  • RT @amazingatheist: Making fun of kids who survived a school shooting just days after the event because you disagree with their politics is… => talk.politics.misc
  • sci.med
  • RT @DavidKlion: Apropos of that D'Souza tweet; I think in order to make sense of our politics, you need to understand that there are some t… => talk.politics.misc
  • RT @BeauWillimon: These students have already cemented a place in history with their activism, and they’re just getting started. No one wil… => talk.politics.misc
  • RT @byedavo: Cause we ain’t got no president => talk.politics.misc
  • RT @appleinsider: .@Apple reportedly in talks to buy cobalt, key Li-ion battery ingredient, directly from miners … => comp.sys.mac.hardware

Here is the link to the complete git repository

Conclusion:

In this blog, we were successful in creating a data pipeline where we were using the Naive Bayes model for doing classification of the streaming twitter data. We can classify other sources of data like news articles, blog posts etc. Do let us know if you have any questions, queries and additional thoughts in the comments section below.

Happy coding!