Root Cause Analysis: OutOfMemoryError in Forked JVM Due to TestNG SuiteRunner Memory Retention



This content originally appeared on DEV Community and was authored by gaurbprajapati

🛠 JVM Crash During TestNG Suite Execution – Root Cause & Fix

Running large-scale UI automation suites can be tricky — especially when TestNG and Maven Surefire are involved. Recently, we hit a JVM crash during execution that took down an entire test suite. After some deep investigation with heap dumps, GC logs, and TestNG internals, here’s the full Root Cause Analysis (RCA) and how we solved it.

⚡ The Error

[ERROR] org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: 
The forked VM terminated without properly saying goodbye. 
VM crash or System.exit called?

At first glance, this looks like a random JVM crash. But the heap dump revealed ~6GB of memory retained by org.testng.SuiteRunner, holding thousands of TestRunner instances — each keeping entire test classes, WebDrivers, and PageObjects alive.

🔍 What is a Forked JVM?

Maven Surefire runs tests in a forked JVM — a separate Java process.

Why?

  • Isolates tests from the main build
  • Allows custom JVM args (-Xmx, GC options, heap dump on OOM, etc.)
  • Enables parallel test execution

Flow:

  1. Maven spawns a new JVM
  2. Tests run inside this forked process
  3. JVM args are applied via <argLine> in pom.xml

🔎 TestNG’s SuiteRunner Explained

org.testng.SuiteRunner is the heart of TestNG suite execution.

Responsibilities:

  • Parse testng.xml
  • Manage <test> blocks via TestRunner
  • Track all test classes & methods executed
  • Aggregate results (pass/fail/skip)
  • Feed data to reporters/listeners

Structure:

SuiteRunner
   └── List<TestRunner>
           └── Test Class Instance
                 ├── WebDriver
                 ├── Page Objects
                 ├── Test Data
                 └── Utilities

💾 Why Memory Leaks Happen

  • SuiteRunner → keeps strong refs to all TestRunners
  • TestRunner → holds ITestContext, ITestResult, and test class instance
  • Test Class → holds WebDriver, Page Objects, Data Models

👉 Until the suite ends, nothing is garbage-collected.

Result:

  • 17,553 TestRunner objects alive
  • Selenium WebDriver objects + DOM snapshots consume huge memory
  • GC can’t reclaim → JVM crashes

📉 RCA Summary

Factor Detail
Error Forked VM crash (Surefire goodbye error)
Cause JVM ran out of memory due to retained references in SuiteRunner
Trigger Large number of tests in a single suite
Leak Source Strong references: SuiteRunner → TestRunner → Test Class
GC Impact Objects never eligible for GC until JVM exits
Result Heap bloat, OutOfMemoryError, JVM crash

🛠 Fixes & Mitigation

1. Move Heavy Fields to Method Scope

Instead of keeping page objects at class level:

// ❌ Problematic
ExternalJobPage externalJobPage;

@BeforeMethod
public void setup() {
    externalJobPage = new ExternalJobPage(getDriver());
}

Use method-level objects:

// ✅ GC-friendly
@Test
public void testSomething() {
    ExternalJobPage page = new ExternalJobPage(getDriver());
    page.verifyJobDetails();
}

2. Nullify References in Cleanup Hooks

@AfterMethod
public void clean() {
    driver = null;
    pageObject = null;
    System.gc(); // Hint GC
}

3. Aggressive Field Nullification (Final Solution)

public void tearDown() {
    try {
        Field[] fields = this.getClass().getDeclaredFields();
        for (Field field : fields) {
            if (field.getName().startsWith("ajc$") || field.getType().isPrimitive()) {
                continue;
            }
            field.setAccessible(true);
            if (!Modifier.isStatic(field.getModifiers())) {
                field.set(this, null);
            }
        }
        log.info("Cleaned up instance for class " + this.getClass().getName());
    } catch (Exception e) {
        log.error("Failed to tear down: {}", e.getMessage());
    }

    System.gc();
}

And ensure cleanup of test data:

@AfterTest
public void clearTestData() {
    try {
        if (TestDataContext.globalTestDataMapSize() > 1) {
            TestDataContext.clearData(testCasePath);
        }
        tearDown();
    } catch (Exception e) {
        log.error("Exception while clearing test data: {}", e.getMessage());
    }
}

4. Split Large Suites

  • Don’t run thousands of tests in one suite
  • Break into smaller testng.xml files

5. Upgrade Tooling

  • Use Maven Surefire 3.1.2+ (better fork handling)
  • Use TestNG 7.x+ (memory fixes included)

6. Explore Dependency Injection (POC Needed)

Using DI (like Guice or Spring) ensures controlled lifecycles for test objects.

✅ Key Takeaways

  • SuiteRunner holds everything until suite ends — design your framework to release memory early.
  • Avoid class-level heavy fields — use method scope.
  • Nullify aggressively in @AfterMethod / @AfterTest.
  • Split test suites — don’t overload a single JVM.
  • Upgrade Surefire + TestNG — newer versions manage memory better.

With these changes, our suite stopped crashing and memory usage dropped drastically. 🚀

💡 If you’re running large-scale TestNG suites with Selenium, check your heap dump once in a while. You might be surprised how much SuiteRunner is holding on to.


This content originally appeared on DEV Community and was authored by gaurbprajapati