Building a FHIR Patient Deduplication System: A Journey from Chaos to Performance – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Budi Widhiyanto

I’m working on a national project to collect health data from legacy systems in two pilot districts in Indonesia. The goal is to create interoperability between different healthcare systems so we can make better healthcare decisions based on complete patient data. It’s an important project, and it’s been a challenging one.

One of the biggest challenges has been patient deduplication. We’re collecting data from multiple legacy systems, and each one has its own way of storing patient information. When we convert all this data to FHIR R4 format, we end up with duplicate patient records—the same person appearing multiple times in our system because they exist in multiple source systems.

This is the story of how I built a patient deduplication system that processes thousands of records in minutes instead of hours, and the lessons I learned from approaches that didn’t work. If you’re working with FHIR data from multiple sources or dealing with patient deduplication in healthcare systems, I hope my experience helps you.

The Beginning: Starting with a Partner’s Approach

We’re working with a technology partner on this national interoperability project. They had already built systems for patient matching in their own applications, and they shared their approach with us. Their method seemed reasonable: when creating a new patient, first search for existing patients using gender and birthdate filters, then apply fuzzy matching on the patient’s name. If you find a good match, use that patient ID. Otherwise, create a new patient. If gender or birthdate are missing, fall back to using NIK (Indonesian National Identity Number) for exact matching.

I implemented their approach in our FHIR converter. Here’s what it looked like:

def getPatientIdWithFuzzyLogic(internal_id, nik, name, birthdate, gender, parent_name):
    # Strategy 1: Use gender and birthdate if available
    if gender and birthdate:
        # Get patients matching gender and birthdate
        params = {'gender': gender, 'birthdate': birthdate}
        candidates = get_patients_with_params(params)

        if len(candidates) > 0:
            # Apply fuzzy matching on names
            for patient in candidates:
                score = fuzz.token_sort_ratio(name, patient.name)
                if score >= FUZZY_THRESHOLD:
                    return [patient["id"]]

    # Strategy 2: Fall back to NIK if gender/birthdate didn't work
    if nik:
        nik_patients = get_patients_with_params({'identifier': nik})
        if len(nik_patients) > 0:
            return [nik_patients[0]["id"]]

    # No match found, create new patient
    return None

The logic seemed solid. Search by demographics first, verify with name matching, fall back to NIK if needed. I deployed it and started converting patient data from our legacy systems.

That’s when the problems started appearing.

The Failed Solution: Why the Partner’s Method Didn’t Work

The partner’s method worked well in their own internal systems, but it didn’t work for our interoperability project. I discovered two fundamental problems that made their approach unsuitable for our needs.

Problem 1: Missing Data in Legacy Systems

The partner’s method relied heavily on having gender and birthdate for every patient. But we had data quality issues. Many patient records were missing gender or birthdate fields. When that happened, the search by demographics would fail, and we’d fall back to NIK matching. But if NIK was also missing or inconsistent, we’d create a duplicate patient.

I started seeing duplicate patients in our FHIR server. The same person would appear multiple times because the legacy data from different sources had different levels of completeness. One source might have gender and birthdate, another might only have NIK, and a third might have partial information. The fuzzy matching couldn’t handle this inconsistency reliably.

Problem 2: Pagination Limits

The bigger problem was the FHIR API’s pagination limit. Our FHIR server returns a maximum of 100 records per search request. When I searched for patients by gender and birthdate, I’d get the first 100 results. If there were more than 100 patients matching those criteria (which is common for popular birthdates), I’d need to paginate through all the results to find the right patient.

But the partner’s code didn’t handle pagination. It only looked at the first page of results. If the patient I was looking for was on page 2 or page 3, the search would miss them, and the converter would create a duplicate.

I could have fixed the pagination issue by implementing proper page-through logic, but that would make every patient search much slower—potentially making multiple API calls just to check if a patient exists. For batch conversion of thousands of patients, this would be too slow.

The Real Problem

The partner’s method was built for their internal systems, where they controlled the data quality and had different constraints. Our situation was different. We were collecting data from multiple independent legacy systems, each with its own data quality issues, and we needed to process it efficiently at scale.

I needed a different approach—one that worked with the data we actually had, not the data we wished we had.

Finding a Better Way

I went back to analyze what reliable data we did have. The answer was NIK—the Indonesian National Identity Number. Almost every patient in our system had a NIK, and it was consistent across different legacy systems. It’s a 16-digit number, always formatted the same way, and it uniquely identifies a person.

Why was I treating NIK as a fallback? It should be the primary method. NIK is more reliable than gender or birthdate for identifying patients in Indonesia. Gender and birthdate can be missing or inconsistent, but NIK is designed to be unique.

Building the Reference System: NIK-First Strategy

I redesigned the patient matching system to use NIK as the primary identifier, with demographic matching as a fallback only when necessary. Here’s the new approach:

def getPatientIdByNIK(nik):
    """Get patient ID by NIK with caching"""
    if nik in nik_cache:
        return nik_cache[nik]

    params = {'identifier': nik}
    patients = get_patients_with_params(params)

    if len(patients) > 0:
        patient_id = patients[0]["id"]
        nik_cache[nik] = patient_id  # Cache for future lookups
        return patient_id

    return None

def getPatientIdWithFuzzyLogic(nik, name, birthdate, gender, parent_name):
    # Strategy 1: Try NIK exact match first (most reliable)
    if nik:
        nik_patients = get_patients_with_params({'identifier': nik, 'active': True})

        if len(nik_patients) > 0:
            # Single match - verify with name fuzzy matching
            if len(nik_patients) == 1:
                patient = nik_patients[0]
                patient_name = get_full_name(patient)
                new_patient_name = name

                score = fuzz.token_sort_ratio(new_patient_name, patient_name)
                if score >= FUZZY_THRESHOLD:
                    return [patient["id"]]

            # Multiple NIK matches - use fuzzy matching to find best
            elif len(nik_patients) > 1:
                best_match = None
                best_score = 0

                for patient in nik_patients:
                    patient_name = get_full_name(patient)
                    score = fuzz.token_sort_ratio(name, patient_name)
                    if score > best_score:
                        best_score = score
                        best_match = patient

                if best_score >= FUZZY_THRESHOLD:
                    return [best_match["id"]]

    # Strategy 2: Fall back to demographic matching if NIK fails
    if gender and birthdate:
        params = {'gender': gender, 'birthdate': birthdate}
        candidates = get_patients_with_params(params)

        if len(candidates) > 0:
            for patient in candidates:
                score = fuzz.token_sort_ratio(name, get_full_name(patient))
                if score >= FUZZY_THRESHOLD:
                    return [patient["id"]]

    # No match found
    return None

This new system inverts the partner’s approach. Instead of searching by demographics first and falling back to NIK, I search by NIK first and fall back to demographics. This solves both problems:

Missing data: NIK is more consistently available than gender/birthdate in our legacy systems
Pagination: Searching by NIK returns far fewer results (usually just one), so pagination isn’t an issue

I also added a caching mechanism with getPatientIdByNIK. When converting thousands of patient records, many of them might be the same person (repeat visits, multiple encounters, etc.). By caching the NIK-to-patient-ID mapping, I avoid making redundant API calls for patients I’ve already looked up.

The fuzzy matching on names is still there as a safety check. Even when I find a patient by NIK, I verify that the name matches using fuzzy string comparison. This catches cases where NIK might have been entered incorrectly or where there might be data quality issues.

Testing the New System

When I tested the new NIK-first system with real data from our legacy systems, it worked much better. The converter found existing patients reliably, even when demographic data was missing. The pagination problem disappeared because NIK searches rarely return more than 100 results. And the caching made bulk conversion much faster.

I watched the logs during a test run converting 1,000 patient records: “Found patient by NIK… Found patient by NIK… Created new patient (no NIK match)… Found patient by NIK (cached)…” The system was working.

But I still had a problem. Before implementing this fix, the old system had already created duplicate patients in our FHIR server. Some NIKs had 2, 3, or even 10 duplicate patient records. While the new reference system prevented future duplicates, I needed to clean up the existing ones.

I needed a deduplication process.

The Deduplication Challenge: Two Versions

Building a system to deduplicate existing patients was a different challenge entirely. With the reference system, I was preventing new duplicates—a relatively simple task of checking before creating. With deduplication, I needed to find all existing duplicates, choose which one should be the “master,” and then update potentially thousands of medical records to point to that master instead of the duplicates.

This was going to touch a lot of data. I needed to be careful.

Version 1: Sequential Processing – The Safe, Slow Way

For my first implementation, I chose the safest possible approach: sequential processing. I would handle one NIK at a time, processing each step completely before moving to the next. No parallelization, no batch operations, just simple, linear execution.

The algorithm was straightforward:

Find all patient IDs with the same NIK
Select which one should be the master (I chose the most recently updated)
Find all resources (observations, encounters, etc.) referencing the duplicate patient IDs
Update each resource to reference the master patient ID instead
Mark the duplicate patients as inactive

I wrote it as a simple loop:

for nik in nik_list:
    # Find duplicate patients
    patient_ids = find_patients_by_nik(nik)

    if len(patient_ids) <= 1:
        continue  # No duplicates, skip

    # Select master patient (most recently updated)
    master_id = select_master_patient(patient_ids)
    duplicate_ids = [pid for pid in patient_ids if pid != master_id]

    # Find all resources referencing duplicates
    for resource_type in RESOURCE_TYPES:
        for patient_id in duplicate_ids:
            resources = fetch_resources(resource_type, patient_id)

            # Update each resource
            for resource in resources:
                update_patient_reference(resource, master_id)
                put_resource(resource)

    # Mark duplicates inactive
    for dup_id in duplicate_ids:
        mark_patient_inactive(dup_id, master_id)

I tested it with a single NIK first. It worked. I checked the data afterward—all the medical records now pointed to the master patient, the duplicates were marked inactive with a replaced-by link to the master. Perfect.

Then I tried it with ten NIKs. It worked, but it took 15 minutes. Okay, that’s not great, but acceptable for a cleanup operation, right?

Then I ran it on our actual list: 68 NIKs with known duplicates. I started the script, watched the logs for a few minutes, then went to get coffee. When I came back 30 minutes later, it had processed 3 NIKs. I did the math. 68 NIKs at 10 minutes each… over 11 hours.

I let it run overnight. The next morning, it had finished successfully. All the duplicates were cleaned up. But 11 hours was not acceptable. We had hundreds more NIKs to process in other datasets. At this rate, a full deduplication would take days, maybe weeks. And during that time, the script would be constantly hammering our FHIR server with API calls.

The problem was obvious: I was making way too many individual API calls. For each patient ID, I was searching for observations one at a time, then encounters one at a time, then medications, procedures, diagnostic reports—the list went on. And FHIR has a lot of resource types that can reference patients. Even though I filtered it down to the most common ones, I was still checking 27 different resource types. For each duplicate patient. Sequentially.

If a single NIK had 3 duplicate patients and each patient had 20 observations, that’s 60 individual GET requests just for observations, plus 60 individual PUT requests to update them. Multiply that by all the other resource types, and you’re talking about hundreds of API calls per NIK. No wonder it was slow.

I watched the script run for a while, looking at the logs. The server was responding quickly—each API call only took a few hundred milliseconds. But I was only making one call at a time. The network latency, the sequential execution, it all added up. I was wasting so much time just waiting.

That’s when I remembered something. When I built the original FHIR converter, I had faced a similar problem. Converting thousands of patient records one at a time was slow. I had solved it by using batch operations and parallel processing. I could apply the same techniques here.

Version 2: Batch Processing & Parallelization – The Fast Way

The key insight was this: most of the steps in deduplication don’t depend on each other. When I’m fetching observations for a patient, I don’t need to wait for the encounters to be fetched first. When I’m updating resources, I don’t need to update them one at a time—I can batch them together.

I redesigned the system with two major optimizations: parallel resource fetching and batch updates.

For parallel fetching, I used Python’s ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, as_completed

def find_all_references(self, patient_ids: List[str], master_id: str = None):
    """Find ALL resources that reference duplicate patient IDs (parallel)"""
    # Only search for duplicates, not master
    search_patient_ids = [pid for pid in patient_ids if pid != master_id]

    all_references = {}

    # Use ThreadPoolExecutor for parallel fetching
    with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
        # Submit all resource type fetches in parallel
        future_to_type = {
            executor.submit(self._fetch_resources_for_type, resource_type, search_patient_ids): resource_type
            for resource_type in PATIENT_REFERENCING_RESOURCES
        }

        # Collect results as they complete
        for future in as_completed(future_to_type):
            resource_type = future_to_type[future]
            try:
                resource_type, resources = future.result()
                if resources:
                    all_references[resource_type] = resources
            except Exception as e:
                logger.error(f"Error fetching {resource_type}: {e}")

    return all_references

Instead of fetching observations, then encounters, then medications sequentially, I now launch all those searches in parallel. Five worker threads (configurable via MAX_WORKERS) simultaneously fetch different resource types. This alone cut the fetching time by about 80%.

But the real performance gain came from batch updates. FHIR supports bundle operations—instead of sending one resource update at a time, you can send a bundle of up to hundreds of updates in a single API call. I implemented this using FHIR batch bundles:

def _create_batch_bundle(self, resources_to_update: List[tuple]) -> dict:
    """Create a FHIR batch bundle for updating multiple resources"""
    bundle = {
        "resourceType": "Bundle",
        "type": "batch",
        "entry": []
    }

    for resource_type, resource in resources_to_update:
        bundle["entry"].append({
            "request": {
                "method": "PUT",
                "url": f"{resource_type}/{resource['id']}"
            },
            "resource": resource
        })

    return bundle

def update_all_references(self, all_references, duplicate_ids, master_id):
    """Update all references using batch bundles"""
    resources_to_update = []

    # Collect all resources that need updating
    for resource_type, resources in all_references.items():
        for resource in resources:
            if self.update_patient_references(resource, duplicate_ids, master_id):
                resources_to_update.append((resource_type, resource))

    # Split into batches of 100
    num_batches = math.ceil(len(resources_to_update) / self.batch_size)

    for i in range(num_batches):
        batch_resources = resources_to_update[i*self.batch_size:(i+1)*self.batch_size]

        # Create and execute batch bundle
        bundle = self._create_batch_bundle(batch_resources)
        result_bundle = self._execute_batch_bundle(bundle)

        # Check results
        for entry in result_bundle.get('entry', []):
            response = entry.get('response', {})
            if response.get('status', '').startswith('2'):
                self.stats['resources_updated'] += 1

Now instead of 60 separate PUT requests to update 60 observations, I send one request with a bundle containing all 60 updates. The FHIR server processes them efficiently on its end, and I get back a bundle with the results.

I also added a smart optimization: only fetch resources for duplicate patients, not the master. Resources already pointing to the master don’t need to be fetched or updated. This simple check cut the amount of data I needed to process roughly in half:

# Only search for duplicates, not master (master resources don't need updating)
search_patient_ids = [pid for pid in patient_ids if pid != master_id]

I ran the new version on the same 68 NIKs that had taken 11 hours before. This time, I watched the progress in real-time. The parallel fetching worked beautifully—I could see all 27 resource types being queried simultaneously. The batch updates were lightning fast—bundles of 100 resources updated in seconds.

Twenty-three minutes later, it was done. The same operation that took 11 hours now took 23 minutes. That’s roughly 30 times faster.

I ran it again on a larger dataset just to be sure. Same results. The system was consistently fast. The optimization worked.

Technical Deep Dives: Solving Real Problems

While the main architecture was solid, I ran into several challenges that required specific solutions. Let me share three that taught me the most.

Challenge 1: Pagination and URL Handling

One issue that cost me two hours of debugging was pagination. When fetching resources, FHIR servers often return results in pages. You get the first 100 results, plus a “next” link to get the next page. Simple enough, right?

Except the “next” link returned by our FHIR server had a trailing slash before the query parameters: /Patient/?_count=100&_page_token=abc. When I tried to fetch that URL, I got 404 errors. The server expected /Patient?_count=100 (no slash before the question mark).

I spent way too long staring at logs before I noticed that subtle difference. Once I saw it, the fix was simple:

def _get_next_link(self, bundle: dict) -> str:
    """Get next page URL from bundle"""
    for link in bundle.get('link', []):
        if link.get('relation') == 'next':
            next_url = link.get('url')
            if next_url and '/fhir/' in next_url:
                path_and_query = next_url.split('/fhir/', 1)[1]
                # Remove trailing slash before query string
                if '/?' in path_and_query:
                    path_and_query = path_and_query.replace('/?', '?')
                return f"{self.base_url}/{path_and_query}"
    return None

This taught me to always validate assumptions about external APIs. Just because something looks like a standard URL doesn’t mean it will work exactly as you expect.

Challenge 2: Choosing the Right Master Patient

Initially, I just picked the most recently updated patient as the master. Simple logic: the newest record is probably the most complete. But then I realized this created a problem. If I ran the deduplication twice, I might choose a different master the second time (if one of the duplicates had been updated in between). This would cause unnecessary churn—moving all those resource references back and forth.

The solution was to implement stability: once a patient has been designated as master, it should stay the master. I did this using FHIR meta tags. When a patient becomes the master, I tag it with a “golden resource” tag that points to itself:

def _has_self_referencing_golden_tag(self, patient: dict) -> bool:
    """Check if patient has golden resource tag pointing to itself"""
    patient_id = patient.get('id')
    if not patient_id:
        return False

    tags = patient.get('meta', {}).get('tag', [])
    for tag in tags:
        if tag.get('system') == 'http://terminology.kemkes.go.id/sp-replaced-by':
            if tag.get('code') == patient_id:
                return True
    return False

def select_master_patient(self, patient_ids: List[str]) -> str:
    """Select which patient should be the master

    Priority:
    1. Active patient with golden resource tag (existing master)
    2. Most recently updated active patient
    """
    patients = [fetch_patient(pid) for pid in patient_ids]
    active_patients = [p for p in patients if p.get('active', True)]

    # Check for existing master first
    existing_masters = [p for p in active_patients
                       if self._has_self_referencing_golden_tag(p)]

    if existing_masters:
        return max(existing_masters,
                  key=lambda p: p.get('meta', {}).get('lastUpdated', ''))['id']
    else:
        return max(active_patients,
                  key=lambda p: p.get('meta', {}).get('lastUpdated', ''))['id']

Now when I run deduplication, it first checks if one of the patients is already marked as master. If so, use that one. If not, pick the most recently updated and mark it as master. This ensures consistency across multiple runs.

Challenge 3: Handling Batch Failures Gracefully

When I first implemented batch updates, I used FHIR transaction bundles (type: “transaction”). These are atomic—either all updates succeed, or they all fail. This seemed safe, but it had a major problem: if even one resource in the batch had an issue, the entire batch would fail, and none of the updates would be applied.

During testing, I had a batch of 100 observations to update. One of them had a validation issue (a missing required field from old data). The entire batch failed, and I had to figure out which one was problematic. This was frustrating and slow.

The solution was to switch to batch bundles (type: “batch”) instead of transaction bundles. With batch bundles, each operation in the bundle succeeds or fails independently:

bundle = {
    "resourceType": "Bundle",
    "type": "batch",  # Independent operations, not atomic
    "entry": [...]
}

Now if one resource in a batch of 100 fails, the other 99 still get updated successfully. I log the failure, track it in my stats, but don’t let it block the entire operation:

for idx, entry in enumerate(result_bundle.get('entry', [])):
    response = entry.get('response', {})
    status = response.get('status', '')

    if status.startswith('2'):  # Success (2xx status code)
        self.stats['resources_updated'] += 1
        self.stats['batch_successes'] += 1
    else:
        # Log failure but continue
        resource_type, resource = batch_resources[idx]
        error_msg = f"Failed to update {resource_type}/{resource['id']}: {status}"
        logger.warning(error_msg)
        self.stats['errors'].append(error_msg)
        self.stats['batch_failures'] += 1

This makes the system much more robust. Even with messy real-world data, the deduplication completes successfully for the vast majority of resources, and I have a clear log of anything that failed.

Making It Production-Ready: The API Layer

The command-line script worked great for batch deduplication, but for ongoing operations, I needed something more accessible. I built a FastAPI wrapper that exposes the deduplication functionality as a REST API.

The API has two main endpoints:

@app.post("/deduplicate")
async def deduplicate_single_nik(request: SingleNIKRequest):
    """Deduplicate patients for a single NIK"""
    start_time = time.time()

    deduplicator = FHIRPatientDeduplicator(
        base_url=FHIR_BASE_URL,
        nik_system=NIK_SYSTEM,
        api_key=API_KEY,
        batch_size=BATCH_SIZE,
        max_workers=MAX_WORKERS
    )

    patient_ids = deduplicator.find_patients_by_nik(request.nik)

    if len(patient_ids) > 1:
        deduplicator.deduplicate_by_nik(
            nik=request.nik,
            patient_ids=patient_ids,
            delete_duplicates=request.delete_duplicates
        )

    duration = time.time() - start_time

    return DeduplicationResponse(
        success=True,
        nik=request.nik,
        resources_found=deduplicator.stats['resources_found'],
        resources_updated=deduplicator.stats['resources_updated'],
        duration_seconds=round(duration, 2),
        timestamp=datetime.utcnow().isoformat()
    )

I added timing information so we can track how long each deduplication takes. This is useful for monitoring and capacity planning. I also added a batch endpoint that processes multiple NIKs in sequence, with per-NIK timing and summary statistics.

The API is deployed on Google Cloud Run, which handles scaling automatically. If we need to process a large batch of NIKs, we can send them to the batch endpoint and it processes them sequentially (to maintain data integrity) while still being fast thanks to the parallelization and batch updates happening under the hood.

The API also makes it easy for other teams to integrate deduplication into their workflows. They can call the endpoint whenever they import new data, and any duplicates get cleaned up automatically.

Reflection & Lessons Learned

Looking back on this project, I’m proud of what I built, but I’m also very aware of what I did wrong and what I’d do differently next time.

What Went Well

The NIK-based reference system is simple and reliable. By choosing the right unique identifier from the start, I avoided all the complexity of demographic matching. The system hasn’t created a single duplicate patient since I deployed it.

The optimization from sequential to batch/parallel processing was a huge win. Going from 11 hours to 23 minutes isn’t just about speed—it’s about practicality. At 11 hours, running deduplication was something you’d do rarely, maybe once a month, as a special operation. At 23 minutes, it’s something you can run weekly, or even daily if needed. That changes how useful the tool is.

The architectural decisions around resilience—using batch bundles instead of transactions, tracking errors but continuing, logging everything—have proven their value. The system handles real-world messy data gracefully. It doesn’t fail catastrophically because one record has a problem.

What I’d Do Differently

I should have thought about deduplication from day one. If I had implemented the NIK check in the original converter, I wouldn’t have created thousands of duplicates that needed cleaning up. This is a classic example of a small amount of foresight preventing a large amount of pain later.

I wasted a month trying to adapt the partner’s solution. I should have analyzed our specific problem more carefully first. Their demographic matching system was sophisticated and well-built, but it was solving a different problem than ours. Understanding the problem deeply before jumping to solutions would have saved a lot of time.

I should have built the parallel/batch version first, or at least earlier. I learned more from building it than I would have from just thinking about it, but if I had started with “how do I make this fast?” instead of “how do I make this work?”, I would have gotten to the good solution faster.

Technical Learnings

Batch operations are powerful. Reducing API calls from hundreds to dozens makes a massive difference. Whenever you’re doing lots of similar operations, look for a way to batch them.

Parallelization works best when operations are independent. Fetching different resource types in parallel is perfect because they don’t depend on each other. But I couldn’t parallelize the actual deduplication of different NIKs because they might reference the same resources. Understanding these dependencies is crucial.

The FHIR standard is well-designed but implementations vary. Features like batch bundles, search parameters, and pagination work slightly differently on different servers. Always test against your actual FHIR server, not just against the spec.

Real-world data is messy. Invalid formats, missing fields, duplicate identifiers—they’re all going to happen. Build your system to handle errors gracefully rather than assuming perfect data.

Future Improvements

If I were to continue improving this system, here’s what I’d add:

More sophisticated master selection. Currently I use “most recently updated” as a tiebreaker. But there are other factors that could matter—which patient has the most complete data, which one has the most recent medical records, which one was verified most recently. A scoring system could help.

Automated detection of new duplicates. Right now someone has to identify that duplicates exist and call the API. I could build a background job that periodically scans for NIKs with multiple active patients and flags them for review or automatic deduplication.

Intelligent merging of patient demographic data. When deduplicating patients, I currently just pick one master patient and mark the others inactive. But sometimes the duplicate records have complementary information—one might have a phone number, another might have an address. I could merge the best available data from all duplicates into the master patient record before marking duplicates inactive. This would ensure no valuable information is lost during deduplication.

Conclusion

Building this patient deduplication system taught me that good software engineering isn’t just about making things work—it’s about making them work well, reliably, and efficiently. It’s about thinking ahead, but also about being willing to rework things when your first approach doesn’t scale.

I made mistakes. I spent time on solutions that didn’t fit my problem. I built a slow version first when I could have built a fast one. But each of those mistakes taught me something valuable. Now I know to identify the right unique identifier before building a system around it. I know to batch operations whenever possible. I know to design for resilience, not just for the happy path.

Most importantly, I learned that performance optimization isn’t just about making things faster—it’s about making them useful. A tool that takes 11 hours to run gets used rarely. A tool that takes 23 minutes gets used regularly. Speed enables usefulness.

If you’re building something similar—whether it’s deduplication, data migration, or any kind of batch processing—I hope my journey helps you avoid some of the wrong turns I took. Think about deduplication early. Choose the right unique identifier. Build for resilience. Batch and parallelize when you can. And don’t be afraid to throw away your first version if it doesn’t scale.

The code is running in production now, quietly cleaning up duplicate patient records every week. It works. It’s fast. And most importantly, it helps make sure that when a doctor looks up a patient’s medical history, they see the complete picture. That’s what matters.

This content originally appeared on DEV Community and was authored by Budi Widhiyanto