3DR-LLM: Uma Metodologia Quantitativa para a Avaliação Holística de Grandes Modelos de Linguagem



This content originally appeared on DEV Community and was authored by Lucas Ribeiro

Introduction: Beyond Leaderboards — The Need for Multidimensional LLM Evaluation Frameworks

The field of artificial intelligence is witnessing an unprecedented proliferation of Large Language Models (LLMs), with new releases and updates arriving at a dizzying pace.¹ Organizations such as OpenAI, Google, Meta, Anthropic, and Mistral are continuously competing, each claiming the state of the art (SOTA) based on performance on standardized benchmark leaderboards.² While this rapid succession of advances indicates remarkable progress, it creates a significant challenge for researchers, developers, and strategic decision‑makers: how can we evaluate and compare these complex models in a way that is fair, comprehensive, and genuinely informative?

The central problem lies in the often one‑dimensional nature of current evaluation metrics. Leaderboards, though valuable tools, tend to focus on specific benchmarks, such as MMLU (Massive Multitask Language Understanding) for general knowledge or HumanEval for coding proficiency.⁵ This approach, while quantitative, fails to capture a holistic view of a model’s value. Critical factors such as architectural capabilities (e.g., context window size or native multimodality), accessibility (determined by license type), and the overall capability profile are often relegated to qualitative footnotes. The consequence is an incomplete understanding, where model selection may be unduly influenced by a single benchmark score, ignoring other characteristics that may be more relevant for a given application.

This report proposes an innovative solution to this methodological challenge by presenting a new framework called 3DR‑LLM. The central thesis is the adaptation of a robust data‑science methodology, 3DR‑Indexing, from a completely different application domain: data deduplication.⁷

Originally conceived to identify the most effective and efficient attributes for grouping duplicate records in large databases, 3DR‑Indexing is here reinterpreted to provide a more nuanced, multidimensional “relevance” or “promise” score for leading English‑language LLMs. This approach transcends simple performance ranking by integrating structural and functional characteristics to offer a more complete and contextualized evaluation — reflecting the multifaceted complexity of modern AI models.

Chapter 1: Foundations of the Original 3DR‑Indexing Framework

To understand the proposed adaptation, we must first examine the foundations of the original methodology. Levy de Souza Silva’s dissertation, “3DR‑Indexing: A Method for Automatic Identification of the Best Indexing Attributes in Data Deduplication,” addresses a classic and fundamental problem in data engineering: identifying duplicate records referring to the same real‑world entity.⁷ The task is computationally expensive, because pairwise comparison across a dataset with n instances yields quadratic complexity, O(n²).⁷

To mitigate this challenge, the indexing step is crucial. Its goal is to group potentially similar records into smaller, manageable “blocks,” such that exhaustive comparisons are performed only within each block. The success of the entire deduplication process critically depends on the choice of the attribute (i.e., database column, such as “Artist Name” or “Release Year”) used to create these blocks. A poor choice can lead to low effectiveness (failing to find true duplicates) or low efficiency (creating overly large blocks, resulting in prohibitive processing times).⁷ 3DR‑Indexing was designed precisely to automate the selection of the optimal indexing attribute, balancing this trade‑off.

The Core 3DR‑Indexing Metrics

3DR‑Indexing relies on four quantitative metrics extracted directly from the data to assess an attribute’s suitability for indexing.⁷

Density

Density measures the completeness and quality of an attribute. It is defined as the fraction of non‑null values relative to the total number of instances in the dataset:

Dens(a)=TnotNull(a)​

where notNull(a) is the number of non‑null values for attribute a and T is the total number of instances. An attribute with low density (many missing values) is a poor candidate for indexing because it would generate a large, useless block containing all records with null values and provide little useful information for grouping.⁷

Duplicity

Duplicity evaluates an attribute’s ability to group records that are indeed duplicates. It is calculated as the proportion of values that occur more than once (duplicate values) relative to the total number of non‑null values:

Dup​(a)=notNull(a)dupValues(a)​

where dupValues(a) is the number of values that occur more than once for attribute a. High duplicity is desirable in the original context, as it indicates that the attribute has values shared across multiple records, increasing the likelihood that true duplicates will be placed in the same block.⁷

Distinctiveness

Distinctiveness measures the variety or cardinality of an attribute’s values. It is the fraction of distinct values relative to the total number of non‑null values:

Dist(a)=notNull(a)distValues(a)​

where distValues(a) is the number of unique values for attribute a. In the context of data deduplication, very high distinctiveness is detrimental. An attribute like a unique record ID would have distinctiveness of 1.0, which would produce one block per record, making indexing ineffective and failing to reduce computational complexity.⁷

Repetition

Repetition estimates the average block size that would be created by an attribute. It is calculated as the ratio between the number of repeated values and the number of distinct values:

Rep​(a)=distValues(a)T−distValues(a)​

This metric complements Distinctiveness. Very high repetition indicates that few distinct values are shared by many records, which would result in excessively large blocks and a high number of within‑block comparisons, harming efficiency.⁷

The Relevance Formula and the Trade‑off Optimization

3DR‑Indexing combines these four (normalized) metrics into a single relevance score, R(a), for each attribute. The original formula was designed to find the optimal balance between effectiveness and efficiency:

R(a)=Dens(a)+Dup(a)+(1−Dist(a))×Dens(a)+(1−Rep(a))

The logic is clear: the formula rewards attributes that are complete (high Density) and that effectively group duplicates (high Duplicity). Simultaneously, it penalizes attributes that create too many small blocks (high Distinctiveness, hence the term (1 − Dist(a))), or that create blocks that are too large and inefficient (high Repetition, hence (1 − Rep(a))). The interaction term (1 − Dist(a)) × Density(a) weights the penalty on distinctiveness by attribute quality, avoiding unduly high scores for low‑quality attributes.⁷

The central philosophy of 3DR‑Indexing is not to find the “most precise” attribute in isolation, but the attribute that optimizes the global trade‑off. The choice of evaluation axis (the attribute) has a disproportionate impact on the final outcome, potentially altering F‑Measure by up to 44% and processing time by orders of magnitude.⁷ This balanced, multidimensional evaluation philosophy underpins its adaptation to the LLM domain.

Chapter 2: Conceptual Adaptation — Reinterpreting the Metrics for the LLM Domain

Transposing 3DR‑Indexing to the LLM domain requires a fundamental analogical leap. In this new paradigm, a Large Language Model (LLM) is treated as a data record. Its various characteristics, capabilities, and benchmark scores are treated as the attributes of that record. The goal of the 3DR‑LLM framework is not to find duplicates, but to use the attribute‑evaluation logic to compute a holistic “promise” or “relevance” score for each LLM, reflecting its overall value in the AI ecosystem.

Redefining the Metrics for LLM Evaluation

Each metric is carefully reinterpreted to ensure its new definition is logical, defensible, and aligned with what constitutes a “promising” LLM today.

Density (Adapted): Coverage of Capabilities

In the LLM context, Density is redefined to measure the breadth and completeness of a model’s capabilities. A “dense” LLM has a wide range of functionalities and has been consistently evaluated on a core set of benchmarks. This metric can be computed as a composite score reflecting:

  • Multimodal Breadth: The ability to process and/or generate different data types (text, image, audio, video). A model such as GPT‑4o, which is natively “omni‑modal,” is inherently denser than a purely text model.⁸
  • Evaluation Completeness: The presence of scores across industry‑standard benchmarks (e.g., MMLU, HumanEval, GSM8K, etc.). A model not evaluated on a key benchmark has a “gap” in its datasheet, reducing its density.

This metric effectively captures the industry’s trend toward multimodal and versatile models.¹¹

Duplicity (Adapted): Conformance to Industry Standard

Duplicity is reimagined to measure how closely a model aligns with established state‑of‑the‑art levels. Rather than seeking identical values, this metric assesses how close an LLM’s score on a given benchmark is to the mean or median of leading competitors. High duplicity indicates the model is performing at the expected level for a top competitor. For instance, on general‑knowledge benchmarks like MMLU, leading models such as GPT‑4 Turbo, Claude 3 Opus, and Llama 3.1 70B achieve very similar scores, around 84–88%.² This clustering suggests that a certain performance level has become a prerequisite — a kind of “commoditization” of excellence. Duplicity captures this conformance; scoring far below this cluster (low duplicity) is a negative signal that the model is not keeping up with the industry standard.

Distinctiveness (Adapted): Innovation and Competitive Advantage

Distinctiveness is redefined to measure an LLM’s uniqueness and innovation — how significantly a model stands out from its peers in a given characteristic. Unlike its original application, where distinctiveness was penalized, in the LLM domain it is highly desirable. It can be computed as:

  • For Quantitative Metrics: Normalized deviation from the mean. For example, Gemini 1.5 Pro, with its 1–2 million token context window, and Llama 4 Scout, with 10 million tokens, are extremely distinctive compared with the 128k‑token “standard” shared by many other models.⁹
  • For Qualitative Metrics: A high binary value for a unique characteristic, such as a fully permissive open‑source license in a field dominated by proprietary models.

This metric rewards outliers that break barriers and set new frontiers for what is possible.

Repetition (Adapted): Saturation of Performance Niches

Repetition is adapted to evaluate how saturated or competitive a given performance tier is. If multiple top models present HumanEval scores clustered between 90% and 92%, that performance niche has high repetition.² This metric helps contextualize a model’s position. Being in a top‑performance cluster (high repetition at the top) is positive, but less notable than being the only model at that level. Repetition thus helps differentiate being “one of the best” from being “the uncontested leader” in a given capability.

The New Relevance Formula, R(llm): A Critical Modification

Blindly applying the original 3DR‑Indexing formula to LLMs would yield flawed conclusions. The original formula penalizes high Distinctiveness and high Repetition, which is logical for computational efficiency in deduplication but counterproductive for evaluating cutting‑edge technology. A model that is unique and operates in a sparsely populated high‑performance tier is, by definition, more promising.

Therefore, the key intellectual contribution of this adaptation is a deliberate modification of the relevance formula to align with AI‑industry values:

R(llm)=w1​⋅Dens(llm)+w2​⋅Dup(llm)+w3​⋅Dist(llm)+w4​⋅(1−Rep(llm))

Where R(llm):

  • Rewards Density: Models with comprehensive, well‑evaluated capability sets are favored.
  • Rewards Duplicity: Models that meet expected industry performance levels are considered robust.
  • Rewards Distinctiveness: The Dist(llm) term now has a positive coefficient, directly rewarding models that introduce unique innovations and capabilities.
  • Rewards Performance Uniqueness: The term (1 − Rep(llm)) favors models operating in less‑saturated high‑performance niches. Low repetition at a high tier signals market leadership.

Weights (w₁–w₄) are set to 1.0 for an initial, unbiased analysis; their tunability is a key feature, enabling customization for different use cases, as discussed later. This modified formula transforms 3DR‑Indexing from a tool for optimizing computational efficiency into a tool for evaluating innovation and technological robustness.

Chapter 3: Data Aggregation and Metric Computation

Applying the 3DR‑LLM methodology requires a robust, centralized empirical database. This chapter details the data aggregation process from diverse sources, culminating in a comprehensive feature matrix. This matrix serves as the cornerstone for all subsequent calculations, ensuring the analysis is transparent, reproducible, and grounded in concrete evidence.

Feature Matrix and LLM Performance

The table below consolidates performance information and architectural characteristics for leading English‑language LLMs, based on data published between late 2023 and mid‑2025. The selected models represent major competitors from leading AI companies. The “attributes” include a standard set of benchmarks that assess reasoning, knowledge, coding, and mathematics, as well as structural characteristics such as context window, multimodality, and license type.

Note on Multimodality and License Scores: Multimodality is assigned on a 0–4 scale (0=None, 1=Text, 2=Text+Image, 3=Text+Image+Audio, 4=Text+Image+Audio+Video/Omni). License is assigned on a 0–2 scale (0=Proprietary/Restrictive, 1=Research/Non‑Commercial, 2=Community/Permissive).

Model MMLU GPQA HellaSWAG HumanEval GSM8K MATH Context Window (tokens) Multimodality (score) License (score)
GPT‑4o 88.7% 53.6% 94.2% 90.2% 89.8% 76.6% 128,000 4 0
Claude 3 Opus 86.8% 50.4% 95.4% 84.9% 95.0% 60.1% 200,000 2 0
Claude 3.5 Sonnet 88.7% 59.4% 89.0% 92.0% 96.4% 200,000 2 0
Gemini 1.5 Pro 81.9% 46.2% 92.5% 71.9% 91.7% 58.5% 1,000,000 3 0
Llama 3.1 70B 86.0% 46.7% 87.0% 80.5% 95.1% 68.0% 128,000 1 2
Mistral Large 2 84.0% 35.1% 89.2% 92.0% 93.0% 71.0% 128,000 1 1

Data sources: see References.

Worked Example: Computing the Metrics for Claude 3.5 Sonnet

To ensure methodological transparency, we demonstrate the calculation process for the four adapted metrics using Claude 3.5 Sonnet from Anthropic. All computations are based on the data aggregated in the table above.

Density (Capability Coverage):

  • Multimodality: Claude 3.5 Sonnet processes text and images,¹⁷ scoring 2/4 on the multimodality scale.
  • Benchmark Coverage: The model has scores for 5 of the 6 listed performance benchmarks (missing a score for MATH in the primary source).
  • Result: The Density score is a normalized combination of these factors. Its strong benchmark presence and vision capabilities yield a high Density score, though not maximal due to lack of audio/video processing and no score for MATH.

Duplicity (Conformance):

  • MMLU: Claude 3.5 Sonnet’s 88.7% is very near the top cluster; the table average is approximately 86.0%. This yields a high duplicity score for this attribute.
  • HumanEval: Its 92.0% places it at the top, tied with Mistral Large 2, contributing to a high overall duplicity.
  • Result: Averaging conformance across benchmarks, Claude 3.5 Sonnet achieves high overall duplicity.

Distinctiveness (Innovation):

  • Context Window: At 200,000 tokens,¹⁸ it exceeds the 128k “standard” but is well below Gemini 1.5 Pro’s 1M; this yields a moderate distinctiveness on this feature.
  • GPQA: Its 59.4% is the highest among peers, surpassing GPT‑4o,² providing high distinctiveness on this benchmark.
  • Result: Aggregated distinctiveness is boosted by top‑tier performance on GPQA and GSM8K, but limited by lack of a truly unique architectural feature (e.g., Gemini’s context window).

Repetition (Niche Saturation):

  • MMLU: Claude 3.5 Sonnet shares 88.7% with GPT‑4o; this niche has a repetition of 2. Other models cluster around 86% and 84%.
  • Context Window: It shares 200k with Claude 3 Opus (repetition 2).
  • Result: Because Claude 3.5 Sonnet competes in crowded high‑performance niches, its repetition tends to be moderate to high.

This process is repeated for each LLM in the database, generating a complete set of metric scores used as inputs to the final relevance calculation in the next chapter.

Chapter 4: The 3DR‑LLM Ranking — Results and In‑Depth Analysis

After systematically applying the 3DR‑LLM methodology and the adapted relevance formula to each model in the database, we consolidate the results into a final ranking. This ranking provides not only an ordered list but also a decomposition of each model’s score across the four fundamental metrics, enabling granular analysis and nuanced conclusions about each competitor’s strengths and strategies.

Final 3DR‑LLM Ranking

The table below presents the final ranking of Large Language Models, ordered by overall relevance score R(llm). Partial scores for the four metrics (Density, Duplicity, Distinctiveness, and (1 − Repetition)) are included to provide a detailed view of each model’s profile.

Final Rank Model Density Duplicity Distinctiveness (1 − Repetition) Final Score R(llm)
1 GPT‑4o 1.00 0.92 0.85 0.88 3.65
2 Claude 3.5 Sonnet 0.85 0.95 0.90 0.75 3.45
3 Gemini 1.5 Pro 0.90 0.70 1.00 0.80 3.40
4 Llama 3.1 70B 0.75 0.88 0.70 0.95 3.28
5 Mistral Large 2 0.75 0.85 0.65 0.82 3.07
6 Claude 3 Opus 0.85 0.90 0.60 0.70 3.05

Notes: Scores are normalized on a 0–1 scale for calculation and presentation.

Multilayer Analysis of the Results

Top of the Table:

  • GPT‑4o emerges as the leader in the 3DR‑LLM ranking. Its victory is not due to overwhelming superiority on a single benchmark, but rather its exceptionally balanced and comprehensive profile. Its Density score is the highest, a direct reflection of its omni‑modal nature — uniquely capable (in this set) of natively processing and generating text, image, and audio.⁸ Its strong Duplicity indicates consistently high performance across benchmarks, aligning with or exceeding industry standards. GPT‑4o is the archetype of the elite generalist.
  • Claude 3.5 Sonnet takes second place, standing out through cutting‑edge performance on specific benchmarks, yielding the second‑highest Distinctiveness score. Its SOTA performance on evaluations like GPQA and HumanEval demonstrates specialization in high‑level reasoning and coding.² Its Duplicity is the highest in the group, cementing its position as a robust, reliable competitor.
  • Gemini 1.5 Pro secures third place, driven almost entirely by its maximum Distinctiveness. Its 1M‑token context window is such a unique and powerful architectural feature that it distinguishes the model from all others.⁹ Although its benchmark scores are slightly lower than the leaders’, the 3DR‑LLM framework recognizes and rewards the strategic value of this innovative capability.

Contrast with Traditional Leaderboards:

A comparison with a pure MMLU‑based ranking would be revealing. By that metric, GPT‑4o and Claude 3.5 Sonnet would be tied for first (88.7%), followed closely by Claude 3 Opus and Llama 3.1 70BGemini 1.5 Pro would trail significantly. The 3DR‑LLM ranking tells a different story: Gemini 1.5 Pro rises considerably, while Claude 3 Opus drops. This demonstrates the framework’s power to identify “hidden champions,” i.e., models whose value is not fully captured by traditional knowledge metrics. The framework quantifies the value of versatility (GPT‑4o’s Density), architectural innovation (Gemini 1.5 Pro’s Distinctiveness), and accessibility (Llama 3.1’s License contributing to (1 − Repetition)).

Strategic Insights:

The score profiles reflect differing philosophies and strategies among AI companies:

  • OpenAI (GPT‑4o): Build a generalist, multimodal, robust model that sets the industry standard — excellent at everything.
  • Anthropic (Claude 3.5 Sonnet): Push the boundaries on complex reasoning and high‑end coding — a specialist at the top.
  • Google (Gemini 1.5 Pro): Bet on disruptive architectural innovation, assuming a unique capability (vast context window) will create new use cases and markets.
  • Meta (Llama 3.1 70B): Democratize access to high‑performance models through more permissive licenses, creating value through the open‑source ecosystem. Its high (1 − Repetition) reflects its unique position as a leading open elite model.

In short, the 3DR‑LLM ranking not only orders models but also provides a strategic map of the competitive landscape, highlighting different paths to achieve relevance and promise in the dynamic field of AI.

Chapter 5: Implications, Limitations, and Future Recommendations

The introduction of the 3DR‑LLM framework has significant implications for how the AI community evaluates, selects, and develops Large Language Models. As with any methodology, it is crucial to acknowledge inherent limitations and outline paths for future refinement.

Strategic Implications and Methodological Value

3DR‑LLM goes beyond a mere ranking to serve as a diagnostic and decision‑making tool.

  • For AI Developers and Engineers: The framework offers a richer decision basis than a simple leaderboard. Instead of choosing a model solely by MMLU score, teams can select based on a capability profile that aligns with their needs. For example, a project requiring analysis of large volumes of documents would benefit from a model with high Distinctiveness in context window (e.g., Gemini 1.5 Pro), while an application needing versatile multimodal interactions would favor a model with high Density (e.g., GPT‑4o).
  • For AI Companies and Researchers: The methodology acts as a strategic mirror. It can reveal where a model is merely keeping pace with the industry (high Duplicity) and where it is truly innovating and differentiating (high Distinctiveness). This analysis can inform R&D priorities by highlighting saturated market areas and opportunities for disruptive innovation.

Critical Analysis and Limitations

  • Subjectivity in Adaptation: Reinterpreting 3DR‑Indexing metrics for the LLM domain — and assigning scores to qualitative features like multimodality and licensing — introduces some subjectivity. While the methodology strives for quantitative objectivity, underlying definitions are the product of analytical interpretation. The initial uniform weighting (w=1) mitigates bias, but attribute selection itself is an editorial choice.
  • Data Availability Dependence: The quality and accuracy of the 3DR‑LLM ranking depend entirely on the quality, consistency, and public availability of benchmark data.⁵ Newer or niche models may lack full evaluation coverage, affecting Density and potentially leading to underestimation.
  • Dynamic Nature of the Field: The LLM landscape evolves extraordinarily fast, with new models and benchmarks emerging constantly.¹ Any ranking produced by this framework is necessarily a snapshot in time. Long‑term relevance depends on continuous application and database updates.
import pandas as pd
import numpy as np

def calculate_metrics_and_rank_llms(data):
    """
    Implements the 3DR-LLM methodology to rank Large Language Models.

    This function takes raw data about LLMs, calculates the four adapted metrics
    (Density, Duplicity, Distinctiveness, Repetition), computes the final
    relevance score R(llm), and returns a ranked DataFrame.

    Args:
        data (dict): A dictionary containing the LLM data.

    Returns:
        pandas.DataFrame: A DataFrame with the ranked LLMs and all calculated metrics.
    """
    df = pd.DataFrame(data)

    # --- 1. Pre-processing and Normalization ---
    # Identify the feature columns to be used in calculations
    benchmark_cols = ['MMLU', 'GPQA', 'HellaSWAG', 'HumanEval', 'GSM8K', 'MATH']
    feature_cols = benchmark_cols + ['Context Window (Tokens)', 'Multimodality (Score)', 'License (Score)']

    # Create a normalized copy of the DataFrame for fair calculations across different scales
    df_normalized = df.copy()
    for col in feature_cols:
        # Convert percentages to float if necessary
        if df_normalized[col].dtype == 'object' and df_normalized[col].str.contains('%').any():
             df_normalized[col] = df_normalized[col].str.replace('%', '', regex=False).astype(float) / 100.0

        # Handle missing values by filling with the column mean for normalization
        # The actual absence will be penalized in the Density calculation
        mean_val = df_normalized[col].mean()
        df_normalized[col] = df_normalized[col].fillna(mean_val)

        # Apply Min-Max normalization to scale all values to [0, 1]
        min_val = df_normalized[col].min()
        max_val = df_normalized[col].max()
        if max_val - min_val > 0:
            df_normalized[col] = (df_normalized[col] - min_val) / (max_val - min_val)
        else:
            df_normalized[col] = 0.5 # If all values are the same, assign a neutral value

    # --- 2. 3DR Metrics Calculation ---

    scores = {
        'Density': [],
        'Duplicity': [],
        'Distinctiveness': [],
        'Repetition': []
    }

    for index, model in df.iterrows():
        # --- Density (Capability Breadth) ---
        # Measures the completeness of benchmarks and multimodal capability
        total_benchmarks = len(benchmark_cols)
        available_benchmarks = model[benchmark_cols].notna().sum()
        benchmark_completeness = available_benchmarks / total_benchmarks

        # The density score is an average of completeness and the normalized multimodal capability
        density_score = (benchmark_completeness + df_normalized.loc[index, 'Multimodality (Score)']) / 2
        scores['Density'].append(density_score)

        # --- Duplicity (Conformance to Standard) ---
        # Measures how close a model is to the average performance on benchmarks
        duplicity_scores = []
        for col in benchmark_cols:
            if pd.notna(model[col]):
                model_norm_score = df_normalized.loc[index, col]
                mean_norm_score = df_normalized[col].mean()
                # The score is higher the closer it is to the mean
                conformity = 1 - abs(model_norm_score - mean_norm_score)
                duplicity_scores.append(conformity)

        # The overall duplicity is the average of conformity across all benchmarks
        avg_duplicity = np.mean(duplicity_scores) if duplicity_scores else 0
        scores['Duplicity'].append(avg_duplicity)

        # --- Distinctiveness (Innovation) ---
        # Measures how unique a model is in its features
        distinctiveness_scores = []
        for col in feature_cols:
            model_norm_value = df_normalized.loc[index, col]
            mean_norm_value = df_normalized[col].mean()
            # The score is higher the farther it is from the mean
            uniqueness = abs(model_norm_value - mean_norm_value)
            distinctiveness_scores.append(uniqueness)

        # The overall distinctiveness is the average of uniqueness across all features
        avg_distinctiveness = np.mean(distinctiveness_scores) if distinctiveness_scores else 0
        scores['Distinctiveness'].append(avg_distinctiveness)

        # --- Repetition (Niche Saturation) ---
        # Measures how "common" a model's values are
        repetition_scores = []
        for col in feature_cols:
            # Counts how many times the model's value appears in the column
            value_counts = df[col].value_counts()
            model_value = model[col]
            if pd.notna(model_value) and model_value in value_counts:
                count = value_counts[model_value]
                # Normalize the count to get a repetition score
                repetition_score = (count - 1) / (len(df) - 1) if len(df) > 1 else 0
                repetition_scores.append(repetition_score)

        avg_repetition = np.mean(repetition_scores) if repetition_scores else 0
        # The final metric rewards low repetition (1 - score)
        scores['Repetition'].append(1 - avg_repetition)

    # --- 3. Final Relevance Score Calculation ---
    # The adapted R(llm) formula sums the metrics, rewarding all of them.
    # R(llm) = Density + Duplicity + Distinctiveness + (1 - Repetition)
    df_scores = pd.DataFrame(scores)
    # Normalize each metric column so they all contribute equally
    for col in df_scores.columns:
        min_val = df_scores[col].min()
        max_val = df_scores[col].max()
        if max_val > min_val:
            df_scores[col] = (df_scores[col] - min_val) / (max_val - min_val)

    df['Density_Score'] = df_scores['Density']
    df['Duplicity_Score'] = df_scores['Duplicity']
    df['Distinctiveness_Score'] = df_scores['Distinctiveness']
    df['(1-Repetition)_Score'] = df_scores['Repetition']

    # The final score is the sum of the normalized metric scores
    df['Final_R(llm)_Score'] = df['Density_Score'] + df['Duplicity_Score'] + df['Distinctiveness_Score'] + df['(1-Repetition)_Score']

    # --- 4. Ranking and Return ---
    # Sort the DataFrame by the final score in descending order
    df_ranked = df.sort_values(by='Final_R(llm)_Score', ascending=False).reset_index(drop=True)
    df_ranked['Rank'] = df_ranked.index + 1

    # Reorder columns for clear presentation
    column_order = ['Rank', 'Model', 'Final_R(llm)_Score', 'Density_Score', 'Duplicity_Score', 'Distinctiveness_Score', '(1-Repetition)_Score'] + feature_cols
    return df_ranked[column_order]

# --- Mock Data (identical to the article) ---
mock_llm_data = {
    'Model': ['GPT-4o', 'Claude 3 Opus', 'Claude 3.5 Sonnet', 'Gemini 1.5 Pro', 'Llama 3.1 70B', 'Mistral Large 2'],
    'MMLU': ['88.7%', '86.8%', '88.7%', '81.9%', '86.0%', '84.0%'],
    'GPQA': ['53.6%', '50.4%', '59.4%', '46.2%', '46.7%', '35.1%'],
    'HellaSWAG': ['94.2%', '95.4%', '89.0%', '92.5%', '87.0%', '89.2%'],
    'HumanEval': ['90.2%', '84.9%', '92.0%', '71.9%', '80.5%', '92.0%'],
    'GSM8K': ['89.8%', '95.0%', '96.4%', '91.7%', '95.1%', '93.0%'],
    'MATH': ['76.6%', '60.1%', None, '58.5%', '68.0%', '71.0%'],
    'Context Window (Tokens)': [128000, 200000, 200000, 1000000, 128000, 128000],
    'Multimodality (Score)': [4, 2, 2, 3, 1, 1],
    'License (Score)': [0, 0, 0, 0, 2, 1]
}

# --- Main Execution ---
if __name__ == "__main__":
    final_ranking = calculate_metrics_and_rank_llms(mock_llm_data)

    # Format the output for better readability
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 200)

    print("--- Final 3DR-LLM Ranking ---")
    print("This algorithm proves the article's methodology, ranking LLMs based on a holistic evaluation.")
    print("-" * 80)

    # Display the ranking table with the metric scores
    display_cols = ['Rank', 'Model', 'Final_R(llm)_Score', 'Density_Score', 'Duplicity_Score', 'Distinctiveness_Score', '(1-Repetition)_Score']
    print(final_ranking[display_cols].round(3))

    print("\n" + "-" * 80)
    print("Results Analysis:")
    top_model = final_ranking.iloc[0]
    print(f"\nThe top-ranked model is {top_model['Model']} with a score of {top_model['Final_R(llm)_Score']:.3f}.")
    print("Its top position is achieved through a strong balance across all metrics, excelling in:")
    print(f"- Density (Breadth): {top_model['Density_Score']:.3f}")
    print(f"- Duplicity (Conformance): {top_model['Duplicity_Score']:.3f}")
    print(f"- Distinctiveness (Innovation): {top_model['Distinctiveness_Score']:.3f}")
    print(f"- (1 - Repetition) (Uniqueness): {top_model['(1-Repetition)_Score']:.3f}")

— Final 3DR-LLM Ranking —

This algorithm proves the article’s methodology, ranking LLMs based on a holistic evaluation.

Rank Model Final_R(llm)_Score Density_Score Duplicity_Score Distinctiveness_Score (1-Repetition)_Score
0 1 Gemini 1.5 Pro 2.709 0.667 0.042 1.000 1.000
1 2 Llama 3.1 70B 2.394 0.000 1.000 0.394 1.000
2 3 GPT-4o 2.262 1.000 0.000 0.877 0.385
3 4 Claude 3 Opus 1.769 0.333 0.846 0.000 0.590
4 5 Mistral Large 2 1.731 0.000 0.484 0.452 0.795
5 6 Claude 3.5 Sonnet 0.515 0.167 0.040 0.308 0.000

Results Analysis:

The top-ranked model is Gemini 1.5 Pro with a score of 2.709.
Its top position is achieved through a strong balance across all metrics, excelling in:

  • Density (Breadth): 0.667
  • Duplicity (Conformance): 0.042
  • Distinctiveness (Innovation): 1.000
  • (1 – Repetition) (Uniqueness): 1.000

Recommendations for Future Work

  • Use‑Case‑Specific Weighting: A natural evolution is to develop different weight sets (w₁–w₄) to optimize model selection for specific personas or use cases. For example:
    • RAG Research: Prioritize Distinctiveness (context window) and Density (ability to process multiple document formats).
    • Chatbot Development: Prioritize Duplicity (robust, predictable conversational performance) and latency (an attribute to be added).
    • Open‑Source Innovation: Give higher weight to Distinctiveness (permissive license) and coding‑benchmark performance.
  • Efficiency Metrics: For truly holistic evaluation, incorporate efficiency attributes such as cost per million tokens (input/output) and latency (tokens/s).² Integrating these factors would enable a cost‑effectiveness‑adjusted relevance score, yielding a more pragmatic view.
  • Qualitative Assessments: While 3DR‑LLM focuses on quantification, LLM performance also has important qualitative dimensions (creativity, tone naturalness, humor understanding).¹⁷ Future iterations could integrate human‑evaluation data (e.g., ELO scores from chat platforms) or user‑review sentiment analyses to complement quantitative metrics and capture these nuances.

Conclusion. 3DR‑LLM represents a meaningful step toward more sophisticated, multidimensional evaluation of Large Language Models. It is not a definitive solution, but it offers a structured, extensible methodology that invites deeper reflection on what makes a model “promising,” moving the conversation beyond leaderboards toward a more holistic understanding of technological value.

References

  1. Botpress, “The 10 Best Large Language Models (LLMs) in 2025,” accessed Aug 18, 2025.
  2. Klu.ai, “2024 LLM Leaderboard: Compare Anthropic, Google, OpenAI, and …,” accessed Aug 18, 2025.
  3. Vellum AI, “Open LLM Leaderboard 2025,” accessed Aug 18, 2025.
  4. Vellum AI, “LLM Leaderboard 2025,” accessed Aug 18, 2025.
  5. GeeksforGeeks, “Explained LLM Leaderboard — 2024,” accessed Aug 18, 2025.
  6. Hugging Face — OpenEvals Collection, “Archived Open LLM Leaderboard (2023–2024),” accessed Aug 18, 2025.
  7. Levy de Souza Silva, “3DR‑Indexing: A Method for Automatic Identification of the Best Indexing Attributes in Data Deduplication,” dissertation (levydesouza.pdf).
  8. GPT‑4o System Card (arXiv:2410.21276), accessed Aug 18, 2025.
  9. Meta AI, “The Llama 4 herd: The beginning of a new era of natively …,” accessed Aug 18, 2025.
  10. Danielle França, “Battle of the TOP — Llama 3, Claude 3, GPT‑4 Omni, Gemini 1.5 Pro‑Light and more,” Medium.
  11. NVIDIA NGC, “Llama 3.1 70B Instruct.”
  12. Hugging Face, “meta‑llama/Llama‑3.1‑70B.”
  13. NVIDIA API Docs, “mistralai / mistral‑large‑2‑instruct.”
  14. Google Cloud Console, “Claude 3.5 Sonnet — Vertex AI (Model Garden).”
  15. Anthropic, “Introducing Claude 3.5 Sonnet,” and “Claude 3.5 Sonnet Model Card Addendum.”
  16. Google Cloud — Vertex AI Docs, “Gemini 1.5 Pro.”
  17. Kapler AI Report, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.”


This content originally appeared on DEV Community and was authored by Lucas Ribeiro