This content originally appeared on DEV Community and was authored by Ziad Alezzi
Before we jump into the data and code, I want to first clear up what even IS “Sentiment Analysis”. If you dont know what it means, it feels like fancy jargon. If you know what it means, it feels as if it’s useless.
So to clear up:
What is Sentiment Analysis?
It means using a Language Model to analyze text (usually reviews) and deduce whether the message is positive or negative.
Why use Sentiment Analysis?
It allows you to gain insight on public opinion or customer feedback on a large scale. Super useful for when you have a product you’re trying to sell.
Writing a Sentiment Model using Transformers
Here’s an overview of all the libraries used in this implementation:
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
Before we get to any code, we need to understand the magical data that will allow us to fine tune a language model to do what we want.
The data consists of Twitter posts discussing different video games. Some of them are happy (they loved the game), some were pissed (they hated the game), and some were neutral.
Here’s an example of two rows of data:
Phew! One person’s such a fan they made a whole wallpaper, while the other’s day is flat out ruined!!
I always LOVE using excel to visualize my data, and play around with it. Some labels in this dataset were “Irrelevant”, so I used excel’s find feature to find and delete all rows with it. I also used excel to convert the labels from text to numbers (Negative –> -1 | Neutral –> 0 | Positive –> 1)
So the data contains 3 columns:
- Game (The name of the game that the Twitter post is refering to)
- Text (The content of the Twitter post)
- Labels (The sentiment \ Can be Positive, Negative or Neutral)
Using the pandas
library we’re able to turn a csv file into a DataFrame.
Code:
df = pd.read_csv('data/training.csv').dropna()
df.describe(include=object)
Output:
Game | Text | |
---|---|---|
count | 61121 | 61121 |
unique | 32 | 57294 |
top | TomClancysGhostRecon | |
freq | 2297 | 139 |
With 60,000+ rows, this dataset is HUGE!! Which is perfect for a NLP task, but not so much for my poor gpu. So in training I instead took a smaller subset of 5,000 shuffled rows.
Next, we should make an extra column in the df that’d be the actual input to the model.
df['input'] = 'Game: ' + df.Game + '; Text: ' + df.Text + ';'
This allows you to format your input like this:
'Game: Borderlands; Text: I am coming to the borders and I will kill you all,;'
Ofcourse, transformers takes as input a dataset
object, not a dataframe.
--> ds = Dataset.from_pandas(df)
Dataset({
features: ['Game', 'labels', 'Text', 'input', '__index_level_0__'],
num_rows: 61121
})
Using a pretrained model
Every deep learning model is simply just a big fancy math function. And ofcourse, you can’t do math on text. (spoilers: YOU NEED NUMBERS!)
So How do you convert text to numbers?
Using tokenization ofcourse! (ah.. What’s that?)
Tokenization means cutting up text into smaller, more digestable, pieces.
This could mean cutting up the text by words (each token is a word), or cutting sub-words.
After Tokenization, there is Numericalization (yay more jargon)
Numericalization is turning those tokens into numbers. It’s doing this by having a really big dictionary of words, and the number representing each word/token is it’s index in the dictionary.
For this NLP task, we’ll use a pretrained Language Model by Microsoft called Deberta.
Here I used the xsmall model which contains 22 million parameters. I tried using the large (300 million parameters) and base (100+ million parameters) but kept getting CUDA: Out Of Memory
tokz = AutoTokenizer.from_pretrained('microsoft/deberta-v3-xsmall')
Here we used AutoTokenizer
from the Transformers library. It allowed us to get a tokenizer from the pretrained model. Here’s an example of how a tokenized scentance looks like:
--> tokz.tokenize(df['input'][1])
['▁Game',
':',
'▁Borderlands',
';',
'▁Text',
':',
'▁I',
'▁am',
'▁coming',
'▁to',
'▁the',
'▁borders',
'▁and',
'▁I',
'▁will',
'▁kill',
'▁you',
'▁all',
',',
';']
Now we write a basic tokenization function and map it to the entire dataset.
def tok_func(x): return tokz(x["input"], truncation=True, padding=True, max_length=512)
tok_ds = ds.map(tok_func, batched=True)
This gives us 2 new collumns to work with: input
and input_ids
This will show you how tokenization really works:
--> tok_ds[1]["input"], tok_ds[1]["input_ids"][0:10]
('Game: Borderlands; Text: I am coming to the borders and I will kill you all;'
[3179, 294, 72459, 346, 7655, 294, 273, 481, 882])
Observing an Input alongside its corresponding ID, something becomes clear
Focus on the first word in the input game “Game”. And the first number in the input_ids “3179”.
Hmmm, now lets look up the vocabulary for the tokenized word “Game”..
--> tokz.vocab[tokz.tokenize("Game")[0]]
3179
I hope you just had your very own Eureka moment.
For the last step before actually training the model, we’ll split our data into training and test sets.
--> dds = tok_ds.train_test_split(0.25, seed=42)
DatasetDict({
train: Dataset({
features: ['Game', 'labels', 'Text', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 45840
})
test: Dataset({
features: ['Game', 'labels', 'Text', 'input', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 15281
})
})
Creating the Training Arguments, and the Trainer
Now’s where it gets serious, get ready for alot of jargon!!
batch_size = 16
epochs = 4
lr = 1e-4
arg = TrainingArguments(
'outputs',
learning_rate=lr,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
fp16=True,
evaluation_strategy='epoch',
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size*2,
num_train_epochs=epochs,
weight_decay=0.01,
report_to='none'
)
model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-xsmall', num_labels=3)
trainer = Trainer(
model,
arg,
train_dataset=dds['train'].shuffle(seed=42).select(range(5000)),
eval_dataset=dds['test'].shuffle(seed=42).select(range(1000)),
tokenizer=tokz
)
trainer.train()
Alot of this is boilerplate code, so to understand what’s going on there’s 3 layers:
- Creating the Training Arguments (Passing in the batch size, learning rate, and number of epoches are the only things you really adjust. The rest is mostly boilerplate)
- Creating the Model (Using the Transformers library, we load the Microsoft model into a variable)
- Creating the Trainer (Simply pass in the Model variable, Training Arguments, Dataset, and the Tokenizer function we defined earlier)
Epoch | Training Loss | Validation Loss | |
---|---|---|---|
1 | 1 | No log | 0.859168 |
2 | 2 | 0.894000 | 0.744544 |
3 | 3 | 0.894000 | 0.714117 |
4 | 4 | 0.588000 | 0.779513 |
Ofcourse, if I had let it train for abit longer it’ll have way better performance, but im content with this for experimentational purposes.
Our model got a total Accuracy of 74% 😀
For fun, I decided to make an excel sheet, showing the twitter posts alongside my model’s prediction and the actual label. Let’s have some fun with these!
Alot of times my model was spot on
Sometimes my model was actually smarter that mislabeled data!
I barely understand this twitter user.. But it’s so obvious he’s damn pissed!
Ofcourse, my model also did some mistakes (awww my poor baby’s learning)
But cmon man.. How is THAT positive?!!
Conclusion
NLP is a field of Deep Learning with many uses in almost any field! Infact, if you’ve heard a little about “ChatGPT” you’ll realize that NLP has already taken over your entire life ;]
The “GPT” in ChatGPT itself stands for “Generative Pretrained Transformer”.
It’s a Pretrained model that Generates text using Transformers (No mom, not the “Bumblebee” transformers)
They all work the same way, with the only difference simply being how the layout of your data (and how many gpus you have..)
Thank you for reading into my nerdy little program, and as a man from CS50 once said:
This Was NLP
This content originally appeared on DEV Community and was authored by Ziad Alezzi