Multithreading Caveats 101: Two Threads, One random Friday – ██FR█████ █INTELL███████████

This content originally appeared on Level Up Coding – Medium and was authored by Shreya Shukla

One Friday, as my roommate and I were getting ready for dinner at a friend’s place, I asked if she knew what multithreading is. We come from different backgrounds and often end up sharing random things about our fields, whether we’re on the way to the grocery store, cooking, or, like now, getting ready to go out. To explain multithreading to her, I came up with a made-up scenario where she and I were two threads given a task to complete. As I was explaining it, I realized something was missing. I then took a moment to refine the made-up scenario, not only to explain the concept better but also to figure out if it was actually helpful or if we’d need something else. This article is inspired by our brief discussion.

Here is the scenario –

She (thread_jo) and I (thread_ss) are in the same room (core), inside an apartment (CPU). Our task (python function) is to count the frequency of words in a large document. The document is a PDF of Harry Potter and the Philosopher’s Stone. We are each given separate portions (chunks) of the document to process i.e. to count the frequency of words in our respective parts (local resource). As we count, we must keep merging the results into a shared frequency count dictionary (shared resource), which holds data like: harry → 100,000, meaning the word “harry” appears in the book 100,000 times.

Figure 1: Made-up multithreading scenario (Image by Author)

We’d write the code to execute below steps-

Read the document sequentially.
Divide it into two chunks for two threads.
Define the task/function that would clean data and count the words.
Define two threads that will execute the function.
Get the final shared frequency count dictionary.

The python code would look something like below –

class WordCount:
    def __init__(self):
        # Shared word count dictionary
        self.word_count_dictionary = defaultdict(int)

    # 3. Define the task/function that would clean data and count the words
    def task_clean_and_count(self, chunk, thread_name):
        print(f"{thread_name} starting work")

        for index in range(len(chunk)):
        # Clean and process text
        # Calculate the word count
        # Update the shared word count dictionary

        print(f"Thread {thread_name}: update complete")
    
if __name__ == "__main__":
        # 1. Read the document sequentially, that mean not through multithreading but with the main thread instead
        doc = fitz.open('harry-potter-and-the-philosophers-stone-by-jk-rowling.pdf')
        num_pages = len(doc)
        
        # 2. Divide data into two chunks for two threads
        mid = num_pages // 2
        chunk1 = doc[:mid]
        chunk2 = doc[mid:]

        word_count = WordCount()

        print("Starting Task.") 
        # 4. Define two threads that will execute the function
        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
            # Thread 1
            executor.submit(word_count.task_clean_and_count, chunk1, 'thread_jo')
            # Thread 2
            executor.submit(word_count.task_clean_and_count, chunk2, 'thread_ss')
 
        print("Task Ended.")

        # 5. Get the final shared frequency count dictionary
        top_words = sorted(word_count.word_count_dictionary.items(), key=lambda i: -i[1])[:5]
        print(f"Top 10 words: {top_words}")

Synchronization

It’s a bit more complicated than it seems, though. We need to make sure we’re not writing to the final frequency count dictionary (shared resource) at the same time, because we might overwrite each other’s updates and end up with incorrect results. To ‘synchronize’ our updates, we place a lock on the shared resource. When we want to update it, we first request the lock, if it’s available, we acquire it, make our update, and then release the lock so the other person can do their update. This also makes the shared resource thread-safe, which is something that can be safely used by multiple threads at the same time, without causing problems like data corruption etc.

Let’s add a lock to the above code –



class WordCount:
    def __init__(self):
        # Shared word count dictionary
        self.word_count_dictionary = defaultdict(int)
        self._lock = threading.Lock()


    # 3. Define the task/function that would clean data and count the words
    def task_clean_and_count(self, chunk, thread_name):
        print(f"{thread_name} starting work")
        time.sleep(0.5)

        for index in range(len(chunk)):
        # Clean and process text
        # Calculate the word count
             with self._lock:
             # Update the shared word count dictionary

        print(f"Thread {thread_name}: update complete")
    
if __name__ == "__main__":
 
        # 1. Read the document sequentially, that mean not through multithreading but with the main thread instead
        doc = fitz.open('harry-potter-and-the-philosophers-stone-by-jk-rowling.pdf')
        num_pages = len(doc)
        
        # 2. Divide data into two chunks for two threads
        mid = num_pages // 2
        chunk1 = doc[:mid]
        chunk2 = doc[mid:]
 
        word_count = WordCount()
 
        # 4. Define two threads that will execute the function
        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
            # Thread 1
            executor.submit(word_count.task_clean_and_count, chunk1, 'thread_jo')
            # Thread 2
            executor.submit(word_count.task_clean_and_count, chunk2, 'thread_ss')
 
        # 5. Get the final shared frequency count dictionary
        top_words = sorted(word_count.word_count_dictionary.items(), key=lambda i: -i[1])[:10]
        print(f"Top 10 words: {top_words}")

It’s important to note that we have no way of predicting which thread will be working at any given moment, it’s random. Technically, thread scheduling is handled by the operating system (OS), and it’s entirely out of our hands. We can sometimes influence it a little though with thread priorities or synchronization mechanisms.

If I add a delay, like time.sleep(0.5), in the code above, I can observe interleaving behavior. Interleaving means the two threads appear to take turns executing, switching back and forth as they progress. In the result below, which shows just 5 pages per thread, we can see that it seems both threads are working alternately. This is just one way to observe how multithreaded execution can be controlled to some extent, rather than being left entirely to the operating system.

Starting Task.
thread_jo starting work
Thread thread_jo: Page 0
thread_ss starting work
Thread thread_ss: Page 0
thread_jo acquired lock
thread_jo released lock
Thread thread_jo: Page 1
thread_ss acquired lock
thread_ss released lock
Thread thread_ss: Page 1
thread_ss acquired lock
thread_ss released lock
Thread thread_ss: Page 2
thread_jo acquired lock
thread_jo released lock
Thread thread_jo: Page 2
thread_ss acquired lock
thread_ss released lock
Thread thread_ss: Page 3
thread_jo acquired lock
thread_jo released lock
Thread thread_jo: Page 3
thread_jo acquired lock
thread_jo released lock
Thread thread_jo: Page 4
thread_ss acquired lock
thread_ss released lock
Thread thread_ss: Page 4
thread_ss acquired lock
thread_ss released lock
Thread thread_ss: update complete
thread_jo acquired lock
thread_jo released lock
Thread thread_jo: update complete
Task Ended.

Caveat of Multithreading

So, this seems fine, we can keep making progress on our own, keep updating the shared dictionary and just vibe with however OS schedules our run. But here’s what I missed explaining: we can’t actually do our work in parallel!!

We are basically using a Python function to calculate the counts, meaning we’re in a Python environment and Python has something called the Global Interpreter Lock (GIL). The GIL only allows one thread to run at a time. It’s like we have just one working laptop: even though both of us want to work, only one can use it at a time, so, we must take turns. Either she can work, or I can, but we can’t both work at the same time or in parallel.

In fact, when I run the complete code with multithreading versus a single thread (main), I get the same result, which in this case was around 0.34 seconds.

Complete code for multithreading –

# imports
from pypdf import PdfReader
import re
from nltk.corpus import stopwords
import nltk
from collections import defaultdict
import threading
import time
import concurrent.futures 
import fitz
 
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))


class WordCount:
    def __init__(self):
        # Shared word count dictionary
        self.word_count_dictionary = defaultdict(int)
        self._lock = threading.Lock()
    
    # 3. Define the task/function that would clean data and count the words
    def task_clean_and_count(self, chunk, thread_name):
        print(f"{thread_name} starting work")
        for index in range(len(chunk)):
            # print(f"Thread {thread_name}: Page {index}"
            input_text = chunk[index].get_text()
            # time.sleep(0.5)

            # Clean and process text
            cleaned = re.sub(r'[^A-Za-z\s]', '', input_text).lower()
            cleaned = re.sub(r'\s+', ' ', cleaned).strip()
            words = cleaned.split()
            
            # Update shared dictionary with lock (page by page)
            with self._lock:
                # print(f"{thread_name} acquired lock")
                for word in words:
                    if word not in stop_words:
                        self.word_count_dictionary[word] += 1
                # print(f"{thread_name} released lock")
 
        print(f"Thread {thread_name}: update complete")
    
 
if __name__ == "__main__":
    
        # 1. Read the document sequentially, that mean not through multithreading but with the main thread instead
        doc = fitz.open('harry-potter-and-the-philosophers-stone-by-jk-rowling.pdf')
        num_pages = len(doc)
        
        # 2. Divide data into two chunks for two threads
        mid = num_pages // 2
        chunk1 = doc[:mid]
        chunk2 = doc[mid:]
 
        print("Starting Task.")
        start_time = time.time()
        
        # 4. Define two threads that will execute the function
        with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
            executor.submit(word_count.task_clean_and_count, chunk1, 'thread_jo')
            executor.submit(word_count.task_clean_and_count, chunk2, 'thread_ss')
 
        print("Task Ended.")
 
        end_time = time.time()
        total_time = end_time - start_time
        print(f"Total time taken: {total_time:.2f} seconds")
        
        # 5. Get the final shared frequency count dictionary
        top_words = sorted(word_count.word_count_dictionary.items(), key=lambda i: -i[1])[:10]
        print(f"Top 10 words: {top_words}")

Is Multithreading helpful?

She was confused and asked, “What’s even the point of both of us working? The time would be the same even if just one of us did the work.”

And she’s right. Multithreading might be more helpful in scenarios where there’s a mix of works — for example, if we had to fetch documents from outside our apartment (CPU), which is an I/O (Input/Output) task. In that case, we could save time when one person is calculating while the other is waiting for documents to arrive.

“It’s confusing, though,” she said. “Doesn’t that GIL thing you mentioned stop that from happening, like running two threads at the same time?”

So, counting words is a CPU-bound task, but reading files is an I/O-bound task. The GIL only limits Python code running on the CPU. If a thread is just waiting for something, like file or network access, it can let go of the GIL, and other threads can work in the meantime. So, when both types of work are involved, threads can still help since there is an overlap of waiting time with useful work.

She understood but wasn’t fully convinced about using multithreading for this problem, and once again, she’s right.

How we optimized our time and what’s next?

Before I could tell her about the solution that might actually speed things up, our Uber arrived. So, while we were busy discussing multithreading in our apartment (CPU), she had already booked the cab (I/O) in parallel. Our conversation and the I/O task overlapped, thus saving time. So, I guess, in the end, the two of us threads did manage to optimize our time a bit.

Let’s end with a scenario where multithreading might be helpful. Say we are web scraping and organizing metadata for a database. While organizing data is a CPU-bound operation and cannot be sped up using multiple threads (due to Python’s GIL), web scraping is an I/O-bound operation and can be parallelized using multithreading. So, while some threads are busy waiting for web responses, others can work on organizing data, effectively overlapping waiting time with useful work.

But what can speed up CPU-bound tasks? Multiprocessing, maybe. We’ll save that for another random Friday, and you can find out in the next article if she’s convinced. Fingers crossed!!

References –

https://realpython.com/intro-to-python-threading/

https://realpython.com/python-gil/

Multithreading Caveats 101: Two Threads, One random Friday was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Shreya Shukla