Introduction / Goals / Scope:
This is a follow-up to my previous blog post looking at how to install/run the new John the Ripper Tokenizer attack [Link]. The focus of this post will be on performing a first pass analysis about how the Tokenizer attack actually performs. Before I dive into the tests, I want to take a moment to describe the goals of this testing. My independent research schedule is largely driven by what brings me joy. Because of that I’m trying to get better at scoping efforts to something I can finish in a couple of days. It’s easy to be interested in something for a couple of days! Therefore, my current plan is to run a couple of tests to get a high level view of how the Tokenizer attack performs and then see where things go. To that end, this particular blog post will focus on three main “tests” to answer a couple of targeted questions. Test 1: Analyze how sensitive Tokenizer is to the size of the training data Question: How sensitive is the Tokenizer attack to being trained on 1mil, or 30+ mil passwords? Impact: Knowing this is important since it determines if the Tokenizer attack can be effective when trained on smaller datasets. This could be a community or language specific target, or a dataset targeting a specific password creation policy. Secondary Reason: Identifying early on how sensitive Tokenizer is to the training size it will help inform other testing options I have available to me. For example can I train it on a subset of RockYou passwords, and then test it against a different subset from that same breach? Also, full disclosure, I made a mistake somewhere along the line of training the Tokenizer in my previous blog post that led me to think it was more sensitive to the training data size then it actually was. Test 2: Compare a short (5 billion guess)Tokenizer attack against Incremental and OMEN.
Tests:
Note on Testing Tools:
Test 1: Analyze how sensitive Tokenizer is to the size of the training data
Training: RockYou
Origin: There are several different LinkedIn datasets from the 2012 Linkedin data breach [Link]. For this test, I’m going to use the original dump that only included around 6.4 million hashes. This dump also had malformed hashes where the first 5 bytes of the hashes were replaced by 0’s. I’m using this dataset vs. some of the later (and larger) datasets since it’s been analyzed the many different academic papers. Obtaining the List: You can download the list from skullsecurity [Link]. I probably should compare my copy of the list to that one, so there might be some differences, but I figure it’s important to point out where other researchers can get a copy. Cracking the List: You can crack the list using the default Hashcat raw-sha1 format since by default Hashcat ignores the first five byes of the hash. I wrote about that more [here]. If you are cracking these hashes in John the ripper, you need to use the format “raw-sha1-linkedin” Obtaining plains: For this attack I was curious how effective the Hashmob plains list would be. Hashmob is a collaborative password cracking site that has some very skilled members (they won this year’s CMIYC competition). So I decided to try it out and promptly fell down a rabbit hole. Before I detour into that research, let me finish up the dataset description. Size of Dataset vs. Cracks: 6,458,020 passwords / 5,980,436 cracked. 92% success rate. Total Side Tangent on LinkedIn List + Hashmob Wordlists:
Test 1 Results:
Test 1 Analysis:
The two tokenizer attacks trained on 1 million passwords performed very similarly (you almost can’t see the second line on the graph). This is a good result since it points to being somewhat resilient to minor differences in the training data. You will notice though that the tokenizer attack trained on the full 32 million RockYou passwords does perform noticeably better. There’s a lot of additional questions that come to mind about this, but I’m going to let these results stand alone for your interpretation and move on to the next set of planned tests.
Bonus Analysis and Correction:
In my previous post I posted the first 25 guesses my training of tokenizer produced, and it looked “weird”. SolarDesigner replied with what they were seeing when running their own copy which was very different (and looked more like what I originally expected) [Link]. I reran all my training, and then started getting similar results to Solar. Long story short, somewhere along the way with my troubleshooting and figuring out this attack I made a mistake. Here are the updated results of the first 25 guesses generated by tokenizer with the Rockyou training data above, along with the results Solar provided:
Test 2: Compare a Tokenizer attack against Incremental and OMEN
Training:
Test 2a Results:
This was interesting, but you really can’t see what’s going on at the start of the password cracking session. So the next graph is the same test/data, but just zoomed in to the first 20 million guesses.
Test 2b Results:
Test 2(AB) Analysis:
Not a lot of surprises here, which is good. OMEN is a very effective attack mode so that was always a tough one to beat. The challenge with OMEN is the lack of an indexing function (aka being able to tell it “generate password at position 2941932”, which leads to complications with pausing/restarting cracking sessions. So I generally use Incremental mode in my real password cracking sessions. It’s just easier. Which means that having the Tokenize attack improve upon standard Incremental mode is a big deal. Side note: I try to point this out whenever talking about OMEN, but you’ll notice the sawtooth success rate as it tends to crack more passwords at the start of OMEN “level”. This highlights significant room for improvement if any researchers want to look into this. Ideally you’d like to have a smoother graph to frontload all your effective guesses near the beginning of your cracking session.
Test 3: Compare Tokenizer and CutB as Part of a Larger Password Cracking Session
For this last test I wanted to simulate a larger cracking session. For this I’m loosely going to base my attacks on EvilMog’s “Random AD Methodology” describe [Here]. By loosely I mean I’m just going to simulate the first three steps:
- run rockyou with -g 100000 or all the rulesets combined (Comparison point) run expander (modified to max at 8 or 10), and then run -a1 (Comparison point) run cutb with -a1
For the first step, I’m going to use the full RockYou wordlist (only unique words) and the “Hashcat” ruleset in John the Ripper. I figure that gets close the the intention of step #1 without having to resort to making 100k random rules up on the spot. The John the Ripper “Hashcat” ruleset is actually a collection of rules from the Hashcat repo modified to work with JtR: [List.Rules:hashcat]
.include [List.Rules:best64]
.include [List.Rules:d3ad0ne]
.include [List.Rules:dive]
.include [List.Rules:InsidePro]
.include [List.Rules:T0XlC]
.include [List.Rules:rockyou-30000]
The challenge from an analysis perspective these attacks generate an absolute ton of guesses! The main reason for the large number of guesses is there are a lot of rules in all of these rulefiles and the RockYou input wordlist at 14 million+ words is pretty hefty. There is room for improvement though since this combined mangling rule list isn’t optimized. For example, all of these rules files are designed to be run individually. So there is a significant overlap in mangling rules between them which generates a large number of duplicate guesses. A smaller nitpicky point is that none of these attacks have “reject” functions built into them so every mangling rule is applied to every input word regardless if the mangling rule would actually change that word. The reason I’m highlighting this isn’t to criticize the rules. I simply want to point out there are areas to improve if anyone wants to dive into that (spoiler: I do not). Ignoring that digression, I guess what I’m trying to say is if I ran this attack with the Rockyou wordlist on my research laptop and piped it into checkpass.py (which itself can be a bit slow), the attack would take me around two weeks to complete. To that end, I ran a “quick” attack of just 5 billion guesses which gets through the best64 ruleset and into d3ad0ne ruleset using checkpass.py simply because I wanted to compare that to my previous graphs. I then launched all these attacks for real on a different computer to create a potfile of all the passwords cracked using these attacks. (Future Improvement): Hashcat supports the ability to record “guess position” in the outfiles (potfiles) it generates. I’ve never really used that, but I plan on looking into that feature in a future “improve my testing process” research sprint. For now though, it’s just easier to launch JtR and let it run while I do other things. While I could be more scientific about it, given the 14 million+ word wordlist (Rockyou-Unique) and the Best64 ruleset (which has slightly more than 64 rules), the Best64 ruleset finishes up somewhere around 1 billion guesses, which is pretty evident from the graph above. The other Hashcat rulesets are not nearly as optimized. This does highlight though that starting a password cracking session off with a “smart” dictionary attack is still one of the best ways to crack passwords quickly. As I mentioned, I then ran the full cracking session to completion using John the Ripper against the hashed LinkedIn passwords. I’ll be using the found/non-found lists from that full run in the following tests. The results of running the full Hashcat rules attack vs. LinkedIn can be seen below.
Introduction to Hashcat Utils:
For this test, steps #2 and #3 involve using expander and cutb. If you are not familiar with these tools, they are part of Hashcat Utilities [Link]. While you can build the tools in Hashcat Utilities from source [Link], the latest release binaries are available [Here]. As to what Hashcat Utilities are, you can get more detailed information from the first link above, but at a high level they are a set of tools that each perform one specific task. Many of them can be chained together (or used stand-alone) to create targeted wordlists which is how we’ll be using them in this experiment. Expander: This tool mangles and creates new combinations of words from individual characters found in each word in the input dictionary. The actual operation is a bit weird, but imagine you wrote the input word on a piece of paper and then folded the paper into a circle so the word is like a bracelet. Expander then creates new words by taking cuts out of that bracelet. So “password123” can generate the guess “3pas” as it wraps around. By default it will generate all 1-4 letter combinations from the input wordlist that is piped to it. Here is an example of me running expander with one input “word”. echo password123 – ./expander.bin
Side note: I was really surprised by guesses Expander didn’t make. For example “23pa” was not generated. So it’s not an exhaustive list and there are some exceptions in the substrings it generates. While Expander will by default only generate 1-4 letter guesses, you can increase this by changing a macro variable in the source and recompiling it. Some people will have multiple versions of expander built with the length of guesses they generate appended to the filename. For example “expander8.bin”. Another approach to make longer guesses without having to recompile the code is to combine multiple runs of “length 4” expander using Hashcat’s combinator mode (attack mode “-a 1”) to generate longer password guesses. Expander is the basis of what’s been called a “Fingerprint” attack. This was first described by pure_hate in the following blogpost where they used it as part of the 2010 CMIYC competition [Link]. A more modern take and example of using a Fingerprint attack can be found [Here]. Now, you generally need to be selective in the input wordlists you feed to Expander since this attack can very quickly get to the point where it’s almost equivalent to a full dumb brute-force attack. You also need to make sure you “sort -u” the outputs of Expander since it often generates a ton of duplicate guesses. Because of this, I generally wouldn’t recommend using Expander on normal password cracking wordlists. Instead, people will often use Expander on previously cracked passwords to get new cracks. For example: Remove the hashes from a standard hashcat potfile and save the results in plains.txt. Note: Unlike John the Ripper’s “–show” command, this will output everything in the potfile vs. generating individual lines for each target hash. cat hashcat.potfile – cut -d: -f2- – sort -u – plains.txt Pipe the plains into expander to create the “base” wordlist. cat plains.txt – expander – sort -u > plains_expanded.txt Run a basic hashcat combinator attack (-a 1) using the plains_expanded.txt wordlists hashcat -m HASH_MODE -a 1 TARGET_HASHES.hash plains_expanded.txt plains_expanded.txt To continue to build this out and target passwords greater than 8 characters long you can re-run variations of the above commands like as follows:
Description of Test 3 Attacks:
Tokenizer_RockyouFull: I’m going to use the version of Tokenizer trained on the full list of 32 million+ Rockyou Passwords Tokenizer_LinkedinPot: This version of Tokenizer is going to be trained on the LinkedIn passwords cracked during the Hashcat rules wordlist attack using the Rockyou_Unique wordlist. Aka I’m training it on the potfile from a previous attack. I’m including duplicated guesses in the training set by generating a list using “./john –show –format=raw-sha1-linkedin –pot=TESING_POTFILE” The goal of this attack is to try and make a direct comparison of Tokenizer to CutB and Expander
Description of Test 3 Target:
All attacks will be run against the remaining uncracked passwords from the 2012 LinkedIn password list after the JtR Hashcat rules with Rockyou-Unique wordlist have been run against it. Each attack will be run for 5 billion password guesses. This is a very short runtime for these attacks. Normally these attacks will generate trillions of password guesses. Future testing might include Hashcat’s outfile debugging formats or running the attacks for a set time (days/weeks), but I figure 5 billion guesses can start to indicate how these attacks will compare to each other.
Test 3 Results:
As for the graph of the results, see below. As a disclaimer, due to the small number of cracks vs. the total size of the list, don’t read too much into it:
Analysis of Tet 3 Results:
While it’s never fun to say that the biggest finding is that your test setup is flawed, that’s my main takeaway from these tests. When looking at the results, 5 billion guesses is way too low a number to analyze these attacks after trillions of guesses have been made running wordlist attacks. Going back to Test 2, (and quick disclaimer this is not a direct comparison due to different training sets for Tokenizer), but Tokenizer cracked over 1 million passwords when it was run as the first attack. So when it cracks just 14k unique passwords more than the Hashcat Rules based attacks, that shows a strong overlap in the guesses that these two attacks are making. This is a long way of saying, after an initial very long run using the Hashcat Rules attack against LinkedIn, I don’t expect any non-wordlist based attack to do very well for just 5 billion guesses. So while it’s easy for me to make fun of Expander, I really can’t make any definitive statement about how these attacks perform in real life unless I run a cracking session that represents several days with a GPU. Looking at the bright side, I’m glad I ran this test. It forced me to better understand some of the tools in Hashcat Utilities, as well as start to identify what future tests should look like as well as gaps in my testing strategies.
Future Research Ideas:
First seen on securityboulevard.com
Jump to article: securityboulevard.com/2024/11/analyzing-jtrs-tokenizer-attack-round-1/