This content originally appeared on DEV Community and was authored by Steven Mathew
Now we are going to split the data to train and test the data to check the accuracy.
df = pd.read_csv('labeled_reddit_comments.csv')
This line reads the previously saved CSV file (labeled_reddit_comments.csv) containing cleaned Reddit comments and their corresponding labels into a Pandas DataFrame (df).
Splitting Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_comment'], df['label'], test_size=0.2, random_state=42)
Here, we split the data into two parts:
X_train and y_train: These variables contain 80% of the data (df[‘cleaned_comment’] and df[‘label’]) which will be used for training the model.
X_test and y_test: These variables contain the remaining 20% of the data, which will be used to evaluate how well the trained model performs on new, unseen data.
Creating a Pipeline with a Random Forest Classifier
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', RandomForestClassifier(random_state=42))
])
This sets up a pipeline (pipeline) that sequentially applies two steps to the data:
Step 1 (‘tfidf’, TfidfVectorizer()): Converts the text data (X_train and X_test) into numerical TF-IDF (Term Frequency-Inverse Document Frequency) vectors.
Step 2 (‘clf’, RandomForestClassifier(random_state=42)): Trains a Random Forest classifier on the TF-IDF vectors. The random_state=42 ensures reproducibility of results.
Defining Hyperparameters for Tuning
param_grid = {
'tfidf__max_features': [10000, 20000, None],
'clf__n_estimators': [50, 100],
'clf__max_depth': [None, 10],
'clf__min_samples_split': [2, 5],
'clf__min_samples_leaf': [1, 2]
}
This dictionary (param_grid) specifies different hyperparameter values to explore during the grid search process:
‘tfidf_max_features’: Limits the number of features generated by TfidfVectorizer.
‘clfn_estimators’, ‘clfmax_depth’, ‘clfmin_samples_split’, ‘clf_min_samples_leaf’: Parameters that control the behavior of the Random Forest classifier.
Performing GridSearchCV for Hyperparameter Tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, error_score='raise')
grid_search.fit(X_train, y_train)
Here, GridSearchCV is used to search for the best combination of hyperparameters (param_grid) for the pipeline (pipeline). It:
Divides the data into 5 folds (cv=5) for cross-validation.
Uses accuracy (scoring=’accuracy’) as the metric to evaluate the performance of each combination of hyperparameters.
Prints detailed messages (verbose=1) during the search process and raises errors (error_score=’raise’) if an error occurs.
Evaluating the Best Model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
After finding the best set of hyperparameters (best_model), the code evaluates this model’s performance on the test data (X_test) that was set aside earlier (y_test).
It:
Predicts labels (y_pred) for the test data.
Calculates and prints the accuracy score (accuracy_score) of the predictions compared to the actual labels (y_test).
Prints a detailed classification report (classification_report) showing precision, recall, F1-score, and support for each class (sarcasm and non-sarcasm).
After training and Testing I got an accuracy of 97%
Testing with sample text
Checking on the top 5 comments on a post on Reddit
GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments
This content originally appeared on DEV Community and was authored by Steven Mathew


