Dark Web Scraping

On June 18, on a popular underground forum, an anonymous threat actor posted what he termed as the “Global Decision Makers Database” – an inventory containing 10 million records collected from LinkedIn, including users’ names, email addresses, and personal profile information.

1

Figure 1: An actor shares a database of 10 million records collected from LinkedIn

 

A statement issued by LinkedIn confirmed the initial reports by PrivacyShark researchers, clarifying that the data published on the underground hacking forum was not a data breach, but rather information harvested through ‘scraping’ publicly available member profile data from the site. This assessment seems reasonable, as the records in the data dump do not appear to include private information such as credit card details, personal communications or passwords. Even so, the fact that a threat actor was able to exploit the LinkedIn API to aggregate this data constitutes a serious breach.

 

Scraped Databases

‘Scraping’ is a popular method of autonomous data extraction and collection, used to retrieve and capture publicly available data and ‘dump’ it into a large, structured and useable database. It is simple to execute; instead of breaking into a server or database, the threat actor simply exploits platform vulnerabilities to gather data that is already publicly available.

The deep and dark web are replete with compromised data, sometimes shared for free and other times sold at a price. In many cases, actors note that the data that they offer was scraped.

The most popular targets of scraping are social media sites. For example, the posts below include data scraped from Facebook, Instagram, and Clubhouse:

2

Figure 2: An actor shares a database of 50 million Facebook users, noting that this was scraped in 2020-2021

 

3

Figure 3: An actor notes that he found a database of nearly 500 million Instagram users

 

4

Figure 4: An actor shares a scraped database of Clubhouse users

 

Scraped data offers many opportunities for threat actors. The personal and professional records leaked may be exploited to target users and distribute spam or automated, targeted phishing attacks (emailing users using other personal data about them to “prove” that it is an official email). Actors can also comb the data for appealing targets for more advanced social engineering attacks or identity theft.

 

Scraping Tools

Actors seeking to scrape and extract data on their own need not look very far. Scraping tools are something of a commodity item on the deep and dark web, often included in larger hacking or cracking “packs.” For example, this collection of cracking tools contains Lol.Boosted Scraper, SLC Scraper v2, and Steam Username Scraper.

5-1

Figure 5: A ‘cracking pack’ containing three scraping tools

 

Similarly, this cracking pack contains several scraping tools:

6

Figure 6: A cracking pack containing several scraping tools

 

There are also scraping tools specific to account providers. This Universal Reddit Scraper v3.3 can scrape Reddits and subreddits:

7

Figure 7: A Reddit scraping tool

 

Likewise, this Telegram scraper collects from Telegram groups:

8

Figure 8: A Telegram scraping tool

 

Finally, this actor posted that he is interested in purchasing a Facebook scraping tool that can also automatically report groups as abusive until they are flagged.

9

Figure 9: An actor seeks to buy a Facebook scraping tool

 

Conclusion

While it is unclear which tool the actor used to procure the LinkedIn data, scraping tools are broadly available to deep and dark web actors looking to aggregate a massive data dump, whether for personal use or to sell.

The products produced by these tools—data dumps containing hundreds of millions of entries—are also shared in considerable quantity. The scraped data can prove highly valuable to threat actors, who can use the collected personal information as a launchpad for the execution of more sophisticated and damaging attacks.

We recommend that social media users exhibit caution with sharing too much personal information, specifically in publicly viewable format, as this data might appear in the next big data dump. Account providers, meanwhile, ought to adopt more stringent methods to prevent large-scale data scraping.