Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments – Training & Testing



This content originally appeared on DEV Community and was authored by Steven Mathew

Now we are going to split the data to train and test the data to check the accuracy.

df = pd.read_csv('labeled_reddit_comments.csv')

This line reads the previously saved CSV file (labeled_reddit_comments.csv) containing cleaned Reddit comments and their corresponding labels into a Pandas DataFrame (df).

Splitting Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(df['cleaned_comment'], df['label'], test_size=0.2, random_state=42)

Here, we split the data into two parts:
X_train and y_train: These variables contain 80% of the data (df[‘cleaned_comment’] and df[‘label’]) which will be used for training the model.

X_test and y_test: These variables contain the remaining 20% of the data, which will be used to evaluate how well the trained model performs on new, unseen data.

Creating a Pipeline with a Random Forest Classifier

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', RandomForestClassifier(random_state=42))
])

This sets up a pipeline (pipeline) that sequentially applies two steps to the data:
Step 1 (‘tfidf’, TfidfVectorizer()): Converts the text data (X_train and X_test) into numerical TF-IDF (Term Frequency-Inverse Document Frequency) vectors.

Step 2 (‘clf’, RandomForestClassifier(random_state=42)): Trains a Random Forest classifier on the TF-IDF vectors. The random_state=42 ensures reproducibility of results.

Defining Hyperparameters for Tuning

param_grid = {
    'tfidf__max_features': [10000, 20000, None],
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [None, 10],
    'clf__min_samples_split': [2, 5],
    'clf__min_samples_leaf': [1, 2]
}

This dictionary (param_grid) specifies different hyperparameter values to explore during the grid search process:
‘tfidf_max_features’: Limits the number of features generated by TfidfVectorizer.
‘clf
n_estimators’, ‘clfmax_depth’, ‘clfmin_samples_split’, ‘clf_min_samples_leaf’: Parameters that control the behavior of the Random Forest classifier.

Performing GridSearchCV for Hyperparameter Tuning

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, error_score='raise')
grid_search.fit(X_train, y_train)

Here, GridSearchCV is used to search for the best combination of hyperparameters (param_grid) for the pipeline (pipeline). It:
Divides the data into 5 folds (cv=5) for cross-validation.

Uses accuracy (scoring=’accuracy’) as the metric to evaluate the performance of each combination of hyperparameters.
Prints detailed messages (verbose=1) during the search process and raises errors (error_score=’raise’) if an error occurs.

Evaluating the Best Model

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

After finding the best set of hyperparameters (best_model), the code evaluates this model’s performance on the test data (X_test) that was set aside earlier (y_test).

It:
Predicts labels (y_pred) for the test data.
Calculates and prints the accuracy score (accuracy_score) of the predictions compared to the actual labels (y_test).

Prints a detailed classification report (classification_report) showing precision, recall, F1-score, and support for each class (sarcasm and non-sarcasm).

After training and Testing I got an accuracy of 97%

Image description

Testing with sample text

Image description

Checking on the top 5 comments on a post on Reddit

Image description

GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments


This content originally appeared on DEV Community and was authored by Steven Mathew