This content originally appeared on DEV Community and was authored by Prashant Nigam
Data is the foundation of any successful AI model. In this part, we’ll explore how to create, format, and prepare high-quality training data that will make our email sentiment classifier incredibly accurate.
Why Data Quality Matters More Than Model Size
Here’s a truth that might surprise you: A smaller model trained on high-quality, domain-specific data can outperforms a massive general-purpose model on specific tasks.
Think of it this way: would you rather have a Swiss Army knife or a scalpel for surgery? General models are Swiss Army knives – versatile but not optimized. Fine-tuned models are scalpels – precise tools for specific jobs.
Understanding Language Model Training Formats
Language models learn by predicting the next piece of text. For fine-tuning, we need to show them examples of the exact conversations we want them to have.
The Anatomy of a Training Example
Every training example teaches the model a specific pattern. For our email sentiment classifier, each example shows:
- The Question: “What’s the sentiment of this email?”
- The Context: The actual email content
- The Expected Answer: The correct sentiment classification
Here’s what this looks like in practice:
{
"prompt": "Classify the sentiment of this email as positive, negative, or neutral.\n\nSubject: Thank you for excellent service\nEmail: I wanted to express my gratitude for the outstanding support I received. The team was helpful and professional.\n\nSentiment:",
"completion": " positive"
}
Notice the space before “positive” in the completion – this helps the model learn proper tokenization.
Chat Templates: Teaching Models to Converse
Modern language models use chat templates to understand conversation structure. Think of them as formatting rules that help the model distinguish between:
- User messages (questions/prompts)
- Assistant messages (responses)
- System messages (instructions)
Understanding the SmolLM2 Chat Template
Our base model (SmolLM2-1.7B-Instruct) uses this chat template:
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{assistant_response}
<|im_end|>
The <|im_start|> and <|im_end|> tokens are special markers that help the model understand who’s speaking.
Why Chat Templates Matter
Without proper formatting, models get confused about who’s saying what. It’s like having a conversation without knowing when each person starts and stops talking. Chat templates provide this crucial structure.
Creating High-Quality Training Data
Let’s build our email sentiment dataset step by step. We’ll create examples that cover the full range of scenarios our model might encounter.
Step 1: Define Our Classification Categories
For email sentiment analysis, we’ll use three clear categories:
- Positive: Grateful, satisfied, complimentary emails
- Negative: Complaints, frustration, dissatisfaction
- Neutral: Informational, requests, general inquiries
Step 2: Create Diverse Email Examples
Here’s our data creation script with detailed examples:
touch data_creation.py
# Create data_creation.py
import json
import random
from typing import List, Dict
def create_training_example(subject: str, email_body: str, sentiment: str) -> Dict[str, str]:
"""Create a properly formatted training example"""
# Create the prompt in a consistent format
prompt = f"""Classify the sentiment of this email as positive, negative, or neutral.
Subject: {subject}
Email: {email_body}
Sentiment:"""
# The completion should start with a space for proper tokenization
completion = f" {sentiment}"
return {
"prompt": prompt,
"completion": completion
}
def generate_positive_examples() -> List[Dict[str, str]]:
"""Generate positive sentiment email examples"""
positive_examples = [
{
"subject": "Thank you for excellent service",
"body": "I wanted to express my gratitude for the outstanding support I received. The team was helpful and professional, and my issue was resolved quickly.",
"sentiment": "positive"
},
{
"subject": "Great job on the project",
"body": "The deliverables exceeded our expectations. The attention to detail and quality of work was impressive. Looking forward to future collaborations.",
"sentiment": "positive"
},
{
"subject": "Wonderful experience",
"body": "Just wanted to share that our experience with your service has been fantastic. The staff is knowledgeable and always willing to help.",
"sentiment": "positive"
},
{
"subject": "Love the new features",
"body": "The latest update is amazing! The new features make everything so much easier. Thank you for listening to user feedback.",
"sentiment": "positive"
},
{
"subject": "Highly recommend",
"body": "I've been using your service for months now and I'm consistently impressed. The reliability and quality are top-notch.",
"sentiment": "positive"
}
]
return [create_training_example(ex["subject"], ex["body"], ex["sentiment"])
for ex in positive_examples]
def generate_negative_examples() -> List[Dict[str, str]]:
"""Generate negative sentiment email examples"""
negative_examples = [
{
"subject": "Disappointed with service",
"body": "I'm extremely frustrated with the poor quality of support I received. My issue has been ongoing for weeks without resolution.",
"sentiment": "negative"
},
{
"subject": "System outage - unacceptable",
"body": "The constant system failures are disrupting our business operations. This is the third outage this month and it's completely unacceptable.",
"sentiment": "negative"
},
{
"subject": "Billing error needs immediate attention",
"body": "I've been charged incorrectly for the third time. This is becoming a serious problem and I'm losing confidence in your billing system.",
"sentiment": "negative"
},
{
"subject": "Very poor customer experience",
"body": "The representative was unhelpful and seemed disinterested in solving my problem. I've never experienced such poor customer service.",
"sentiment": "negative"
},
{
"subject": "Product quality issues",
"body": "The product arrived damaged and doesn't match the description. I'm disappointed and expect a full refund immediately.",
"sentiment": "negative"
}
]
return [create_training_example(ex["subject"], ex["body"], ex["sentiment"])
for ex in negative_examples]
def generate_neutral_examples() -> List[Dict[str, str]]:
"""Generate neutral sentiment email examples"""
neutral_examples = [
{
"subject": "Account information update",
"body": "Please update my billing address to the new address I provided. Let me know when this has been completed.",
"sentiment": "neutral"
},
{
"subject": "Question about pricing",
"body": "Could you provide information about your enterprise pricing plans? We're evaluating options for our team of 50 users.",
"sentiment": "neutral"
},
{
"subject": "Meeting reschedule request",
"body": "I need to reschedule our meeting from Tuesday to Thursday due to a scheduling conflict. Please confirm if this works.",
"sentiment": "neutral"
},
{
"subject": "Documentation request",
"body": "Can you send me the technical documentation for the API integration? I need this for our development team.",
"sentiment": "neutral"
},
{
"subject": "Password reset",
"body": "I'm unable to access my account and need to reset my password. Please send reset instructions to this email address.",
"sentiment": "neutral"
}
]
return [create_training_example(ex["subject"], ex["body"], ex["sentiment"])
for ex in neutral_examples]
def create_balanced_dataset() -> List[Dict[str, str]]:
"""Create a balanced dataset with equal representation"""
print("Creating balanced email sentiment dataset...")
# Generate examples for each category
positive_examples = generate_positive_examples()
negative_examples = generate_negative_examples()
neutral_examples = generate_neutral_examples()
# Combine all examples
all_examples = positive_examples + negative_examples + neutral_examples
# Shuffle to avoid category clustering
random.shuffle(all_examples)
print(f"Created {len(all_examples)} training examples:")
print(f" Positive: {len(positive_examples)}")
print(f" Negative: {len(negative_examples)}")
print(f" Neutral: {len(neutral_examples)}")
return all_examples
def save_training_data(examples: List[Dict[str, str]], filename: str = "training_data.jsonl"):
"""Save training data in JSONL format"""
with open(filename, 'w') as f:
for example in examples:
f.write(json.dumps(example) + '\n')
print(f"✅ Saved {len(examples)} examples to {filename}")
def preview_examples(examples: List[Dict[str, str]], num_preview: int = 3):
"""Preview some training examples"""
print(f"\n📋 Preview of {num_preview} training examples:")
print("=" * 80)
for i, example in enumerate(examples[:num_preview]):
print(f"\nExample {i+1}:")
print(f"Prompt:\n{example['prompt']}")
print(f"Expected completion: '{example['completion']}'")
print("-" * 40)
if __name__ == "__main__":
# Create the dataset
training_examples = create_balanced_dataset()
# Preview some examples
preview_examples(training_examples)
# Save to file
save_training_data(training_examples)
print("\n🎉 Training data creation complete!")
So, let’s examine what data we just created and understand the format
After running python data_creation.py, you will see this output and a new file:
Terminal Output:
Creating balanced email sentiment dataset…
Created 15 training examples:
Positive: 5
Negative: 5 Neutral: 5
Saved 15 examples to training_data.jsonl
Training data creation complete!
New File Created:
-
training_data.jsonl(2-3 KB) – Your training dataset
### Understanding JSONL Format
JSONL (JSON Lines) is the standard format for ML training data. Unlike regular JSON, each line is a separate JSON object:
Regular JSON:
[
{"prompt": "...", "completion": " positive"},
{"prompt": "...", "completion": " negative"}
]
JSONL (what we created):
{"prompt": "...", "completion": " positive"}
{"prompt": "...", "completion": " negative"}
Why JSONL for training?
- Memory efficient: Process one example at a time
- Streamable: Handle huge datasets without loading everything
- Standard: All ML frameworks expect this format
Your training_data.jsonl contains 15 examples (5 positive, 5 negative, 5 neutral) – each line teaching the model how to classify email sentiment. This file is the foundation for everything that follows.
Converting training data to MLX format
What is MLX?
- MLX format refers to the specific data format expected by MLX (Apple’smachine learning framework for Apple Silicon).
- Apple’s ML framework optimized for M1/M2/M3 chips
- Designed to leverage Apple Silicon’s unified memory architecture
- Efficient for training and running models on Mac hardware
MLX Training Data Format:
- Uses JSONL (JSON Lines) where each line contains a single JSON object
- Each object has a text field with the complete training example
- Format: {“text”: “your complete training text here”}
Why the specific format?
MLX’s fine-tuning tools expect this simple structure so they can:
- Stream data efficiently during training
- Apply the model’s chat template automatically
- Handle tokenization and batching internally
Original Format (JSONL):
{
“prompt”: “Classify the sentiment of this email as positive, negative,
or neutral.\n\nSubject: Thank you for excellent service\nEmail: I
wanted to express my gratitude for the outstanding support I received.
The team was helpful and professional.\n\nSentiment:”,
“completion”: ” positive”
}
MLX Format (after conversion):
{
“text”: “Classify the sentiment of this email as positive, negative,
or neutral.\n\nSubject: Thank you for excellent service\nEmail: I wanted
to express my gratitude for the outstanding support I received. The
team was helpful and professional.\n\nSentiment: positive”
}
Key Difference:
- Original: Separate prompt and completion fields
- MLX: Single text field combining both (concatenated together)
The conversion essentially does: text = prompt + completion
touch convert_to_mlx.py
# Create convert_to_mlx.py
import json
import os
from pathlib import Path
def convert_to_mlx_format(input_file: str = "training_data.jsonl",
output_dir: str = "data/mlx_format"):
"""Convert JSONL training data to MLX format"""
print(f"Converting {input_file} to MLX format...")
# Create output directory
Path(output_dir).mkdir(parents=True, exist_ok=True)
# Read training data
examples = []
with open(input_file, 'r') as f:
for line in f:
if line.strip():
example = json.loads(line)
# MLX format combines prompt and completion into a single text field
text = example['prompt'] + example['completion']
examples.append({"text": text})
# Save training data
train_file = os.path.join(output_dir, "train.jsonl")
with open(train_file, 'w') as f:
for example in examples:
f.write(json.dumps(example) + '\n')
print(f"✅ Converted {len(examples)} examples")
print(f"✅ Saved to {train_file}")
# Create a small validation set (10% of data)
val_size = max(1, len(examples) // 10)
val_examples = examples[:val_size]
train_examples = examples[val_size:]
# Save validation data
val_file = os.path.join(output_dir, "valid.jsonl")
with open(val_file, 'w') as f:
for example in val_examples:
f.write(json.dumps(example) + '\n')
# Update training data to exclude validation examples
with open(train_file, 'w') as f:
for example in train_examples:
f.write(json.dumps(example) + '\n')
print(f"✅ Created train set: {len(train_examples)} examples")
print(f"✅ Created validation set: {len(val_examples)} examples")
return len(train_examples), len(val_examples)
def preview_mlx_format(output_dir: str = "data/mlx_format"):
"""Preview the MLX formatted data"""
train_file = os.path.join(output_dir, "train.jsonl")
print("\n📋 Preview of MLX formatted data:")
print("=" * 80)
with open(train_file, 'r') as f:
for i, line in enumerate(f):
if i >= 2: # Show first 2 examples
break
example = json.loads(line)
print(f"\nExample {i+1}:")
print(f"Text: {example['text'][:200]}...") # Show first 200 chars
print("-" * 40)
if __name__ == "__main__":
# Convert the data
train_count, val_count = convert_to_mlx_format()
# Preview the results
preview_mlx_format()
print(f"\n🎉 MLX format conversion complete!")
print(f"Ready for training with {train_count} examples")
Takes 10% of examples for validation and remaining 90% will be used for training
Run the conversion:
python3 convert_to_mlx.py
After running python3 convert_to_mlx.py, you will see two new files created under data/mlx_format/:
valid.jsonltrain.jsonl
Now the data is ready and we will head into the next section, where we will get to the meat of this series, which is executing Fine-Tuning.
This content originally appeared on DEV Community and was authored by Prashant Nigam