Revival Hijacking: How Deleted PyPI Packages Become Threats



This content originally appeared on DEV Community and was authored by Dmitry Protsenko

You can find the original version of this article on my website, protsenko.dev. 

Hi guys, recently I continued researching malicious dependencies and dependency confusion.

This article is inspired by the article from 2024 and raises again concerns about the safety of downloading packages from public repositories without additional controls.

The pypi.org allows packages to be removed by author and package name, making them publicly available for registration by anyone. By registering formerly popular packages, attackers could gain many downloads of malicious libraries.

This technique is called revival hijacking. In this article, I look for popular removed packages and exploit this vector to download many stub packages.

Information about deleted and revived packages can be found in the dedicated GitHub project.

It is updated on a daily schedule to provide fresh lists of these packages and is a result of work on this article.

Harvesting the data

How many packages were deleted from PyPI.org? To answer that question, we need to find out what package names are available currently and what was removed, analyze how many deleted packages were downloaded, and identify who depends on them.

You need to use the official index API to harvest currently available package names. This API produces all the package names in the format you want. I consumed it in JSON for further processing. This method doesn’t provide 100 percent data validity, but all the “removed” packages could be checked for presence in the public registry.

The next step is harvesting every package name on PyPI. For this, I used the Clikpy service, which stores downloads and other valuable analytics about Python packages.

Clickpy allows you to perform custom SQL queries on their Clickhouse instance. With that possibility, we could dump a table with PyPI downloads to have information about every package name and even the download count. Download count is worth information because it allows us to define whether a package was popular.

By default, ClickHouse is limited to querying 100k rows per query, and you need to use pagination to retrieve them all. It’s around eight queries to dump the whole table. You’ve got to install ClickHouse CLI or use another connection method to save results locally.

After deep research on quality data, I found some package names invalid or trimmed. This means that my results are not 100 percent clean. To get better data, you could retrieve package names via BigQuery tables from deps.dev.

Analyzing collected data

We have all the necessary data for this step. Let’s figure out how many packages were removed and could be squatted.

The formula to find it is easy: deleted packages = packages from clickpy — packages from pypi. The answer to that question is 91752. This is a really big count of packages that were removed and could be squatted. Let’s aim for only the popular ones.

It’s better to skip packages that have fewer than a thousand downloads, leaving only 15467 packages with 233 million downloads. This download count is huge, but it’s worth mentioning that this download counter could be faked.

During data analysis, I had an idea to build graphs on removed packages and their dependents to find something interesting. Speaking of the future, I found those packages. Some of them were valid cases, some of them were not.

You could collect all the necessary data using the following ideas:

  • Find information on who is dependent on removed packages using clickpy.
  • Find information about every published package and its dependencies to find removed dependencies in it. For that, you could use deps.dev or the PyPI API.

At first, I tried using Clickpy, as it provided data about dependent packages that needed to be removed. However, after dumping the necessary information, I found that it was not relevant as it contained data about non-existent dependencies, so the next numbers will be a bit dirty.

Based on clickpy data, only 520 packages have 1405 dependents, which gives us 228 billion downloads. That means that packages already have a transitive dependency that could be hijacked to deliver malicious code. Some of these removed packages contain prohibited keywords or refer to a Python built-in function and couldn’t be registered.

Exploitation

My next stage in the research was publishing stub packages on PyPI. PyPI.org doesn’t allow security research on its platform. During my research, my accounts, which I used for publishing, were disabled, but packages were still accessible. This gap between disabling and removing packages gave me time to collect analytics to present to the public.

Please do not repeat this activity after me, as it burdens support with detecting and removing stub packages. As a concise security researcher, I contacted support about the future of my accounts and asked them to enable me to remove all the published packages on my side. I promised them not to repeat this activity anymore.

Okay, let’s go back to the exploitation. I took most downloadable deleted packages. Some of them I validated manually via GitHub search, and even found possibilities of dependency confusion in popular projects that were fixed after messaging them. Some of them I took only by their download count without validation.

You could publish only 20 packages in an hour, so with this limit, I was able to publish 168 packages from two accounts. Those squatted packages had 45 million downloads in the past; it’s a huge drop, and someone probably will download these squatted packages.

Publishing was iterative and not consistent. Research continued for one week after the first batch of packages was published and ended with the removal of packages. During the research, squatted packages were downloaded 32,036 times.

There were leaders in downloads:

  • febolt — 3056
  • flatpack — 2456
  • chiquito — 1097
  • spl-transpiler — 750
  • chalk-harness — 605

All the published packages for installation raised the following error:

Installation terminated!

This is a stub package intended to mitigate the risks of dependency confusion.

It holds a once-popular package name removed by its author (or for other reasons, such as security).

This is package not intended to be installed and highlight problems in your setup.

Read more: https://protsenko.dev/dependency-confusion

The last link refers to the article that describes dependency confusion attacks and ways to mitigate them. This page received eight times more views than before. Real users came to the page to read more about this problem.

The number of downloads was insane. Due to a lack of moderation and the possibility of squatting removed packages, users who installed the stub package could be infected.

Mitigating the problem

How to avoid? Dependencies should be proactively scanned for presence in public repositories. If your projects use internal package repositories with mirrored packages, those packages could be validated against PyPI.org or deps.dev for existence.

If you find one of these packages, you should deal with it or check its correctness using internal repositories to prevent downloading a higher version from the public.

For the source information about deleted or revived packages, you could use: package name lists from the Deleted & Revived PyPI Package Indexes.

Removed packages from PyPI.org have become critical problems as they could be used to deliver malicious code directly into a closed environment.

At the end

It was a pleasant journey to play with data from Clickhouse, BigQuery instances, and PyPI along with deps.dev API. There are more vectors for the attacks, which should be investigated, but for now, the article is ending.

Follow me on LinkedIn to learn about new articles. Links are in the footer of the page.


This content originally appeared on DEV Community and was authored by Dmitry Protsenko