Collection #1 Data Breach Analysis – Part 1

Cybersecurity expert Marco Ramilli has analyzed the huge trove of data, called Collection #1, that was first disclosed by Troy Hunt.

Few weeks ago I wrote about “How Data Breaches Happen“, where I shared some public available “pasties” within apparently (not tested) SQLi vulnerable websites. One of the most famous data breaches in the past few years is happening in these days. I am not saying that the two events are linked, but I have fun in thinking that events happen in bursts. Many magazines all around the world wrote about the data breach (Collection #1) published by Troy Hunt on 773 Millions of new Records (here). Today I’d like to write a quick partial analysis that I’ve been able to extract from those records (I grabbed data from public available pasties website). First of all, let me say that the work done has been super difficult (at least to me) since it required a huge amount of computational power and very high-speed internet access because of the humongous collected data. In order to make analysis over such a humongous data breach, I used a powerful Elastic Search Cloud instance and I wrote a tiny python script to import super dirty data into a common format. Some records were unable to load since the format type, the charset or whatever it had, so please consider a relative error about 4 to 5 % (circa) in the following data analyses.

One of the first questions I wanted to answer was: “What are the most used passwords ?“. I am aware that many researches wrote about the most used passwords, but now I do have the opportunity to measure it. To get real used passwords and to evaluate the reality. So let’s see what are the most used passwords out there!

Collection #1 PARTIAL Analysis on used passwords

So far the most used passwords are: “123456”, “q1w2e3r4t5y6”, “123456789”, “1qaz2wsx3edc”, followed by most common passwords like “12345678” and “qwerty”. By observing the current graph and comparing it to common researches on frequently used passwords such as here, here, and here we might appreciate a significative difference: the pattern complexity! In fact, while years ago the most used passwords were about names, dates or simple patters such as “qwerty”, today we observe a significative increase in pattern complexity, but still too easy to be brute-forced.

A second question came by looking at leaked emails. “What are the domain names of the most leaked emails ?” Those domains are not the most vulnerable domains but rather the most used ones. So I’m not saying that those domains are/or have been vulnerable or Pwned, but I am trying to find what are the most leaked email providers. In other words if you receive an email from “@gmail.com” what is the probability that it has been leaked and potentially compromised ? Again I cannot answer to such a question since I do not have the total amount of “@gmail.com” accounts all around the word, but I think it might be a nice indicator to find out what are the most leaked email domain names.

The most leaked emails come from “yahoo.com”, “gmail.com”, “aol.com” and “hotmail.com”. This is quite interesting since we are mostly facing personal emails providers (domains) rather then professional emails providers (such as company.com). So apparently, attackers are mostly focused in targeting people rather then companies (maybe attacking not professional websites and/or distributing malware to people rather then companies domain names). Another interesting data to know is about the unique leaked email domain names: 4426, so far !

Finally, it would be great to know from what sources data is coming from ! At such a point I have no evidences of what I am going to write about, but I made some deductions from the data leaked structure. The following image shows collection-1 structure.

Each folder holds .TXT files which have names that look like domain names. Some of those are really domain names (tested), some other are on-sale right now, and many other seem to just look like a domain, but I had no evidence of them. Anyway, I decided to assume that the file names looking like domain names are the domain from which the attacker leaked information. So, having such in mind we might deduce where the attacker extracted the data (username and passwords) and perform a personal evaluation about the leaked information.

Are you interested in Marco Ramilli’ conclusions? Give a look at his post:
https://marcoramilli.com/2019/01/19/collection-i-data-breach-analysis-part-1/

This post Collection #1 Data Breach Analysis – Part 1 originally appeared on Security Affairs.