Detecting encrypted traffic with frequency analysis
Let’s start with a little disclaimer:
I am not a cryptanalyst. I am not a mathematician. It is quite possible that I am a complete idiot. You decide.
With that out of the way, let’s begin.
NSM advocates the capture of, amongst other things, full-content data. It is often said that there’s no point in performing full-content capture of encrypted data that you can’t decrypt – why take up disk space with stuff you’ll never be able to read? It’s quite a valid point – one of the networks I look after carries quite a bit of IPSec traffic (tens of gigabytes per day), and I exclude it from my full content capture. I consider it enough, in this instance, to have accurate session information from SANCP or Netflow which is far more economical on disk space.
That said, you can still learn quite a bit from inspecting full-content captures of encrypted data – there is often useful information in the session setup phase that you can read in clear text (e.g., a list of ciphers supported, or SSH version strings, or site certificates, etc.). It still won’t be feasible to decrypt the traffic, but at least you’ll have some clues about its nature.
While we’re talking about full content, I suppose I should briefly address the issue of encryption. Yes, encryption is a problem. Shoot, even binary protocols, obscure protocols, and the like make understanding full content difficult and maybe impossible. Yes, intruders use encryption, and those that don’t are fools. The point is that even if you find an encrypted channel when inspecting full content, the fact that it is encrypted has value.
That sounds reasonable to me. If you see some encrypted stuff and you can’t account for it as legitimate (run of the mill HTTPS, expected SSH sessions, etc.) then what you’re looking at is a definite Indicator, worthy of investigation.
So, let’s just ask our capture-wotsits for all the encrypted traffic they’ve got, then, shall we? Hmm. I’m not sure of a good way to do that (if you do, you can stop reading now and please let me know what it is!).
…I’ve got an idea.
Frequency analysis is a useful way to detect the presence of a substitution cipher. You take your ciphertext and draw a nice histogram showing the frequency of all the characters you encounter. Then you can make some assumptions (like the most frequent character was actually an ‘e’ in the plaintext) and proceed from there.
However, the encryption protocols you’re likely to encounter on a network aren’t going to be susceptible to this kind of codebreaking. The ciphertext produced by a decent algorithm will be jolly random in nature, and a frequency analysis will show you a “flat” histogram.
So why am I talking about frequency analysis? Because this post is about detecting encrypted traffic, not decrypting it.
Over at Security Ripcord, there’s a really nifty tool for drawing file histograms. Take a look at the example images – the profile of the histograms is pretty “rough” in nature until you get down to the Truecrypt example – it’s dead flat, because a decent encryption algorithm has produced lots and lots of nice randomness (great terminology, huh? Like I said, I’m not a cryptanalyst or a mathematician!)
So, here’s the Crazy Plan for detecting encypted traffic:
- Sample X contiguous bytes of a given session (maybe twice, once for src->dst and once for dst->src). A few kilobytes ought to be enough to get an idea of the level of randomness we’re looking at.
- Make your X-byte block start a little way into the session, so that we don’t include any plaintext in the session startup.
- Strip off the frame/packet headers (ethernet, IP, TCP, UDP, ESP, whatever) so that you’re only looking at the packet payload.
- Perform your frequency analysis of your chunk of payload, and “measure the resultant flatness”.
- Your “measure of flatness” equates to the “potential likelihood that this is encrypted”.
Perhaps one could assess the measure of flatness by calculating the standard deviation of the character frequencies? Taking the Truecrypt example, this is going to be pretty close to zero; the TIFF example is going to yield a much higher standard deviation.
Assuming what I’ve babbled on about here is valid, wouldn’t it be great to get this into Sguil? If SANCP or a Snort pre-processor could perform this kind of sampling, you’d be able to execute some SQL like this:
select [columns] from sancp where src_randomness < 1 or dst_randomness < 1
…and you’d have a list of possibly encrypted sessions.
How’s that sound?
This post has been updated here.
Check out InfoSec Institute for IT courses
including computer forensics boot camp training.
Alec Waters is responsible for all things security at Dataline Software, and can be emailed at alec.waters(at)dataline.co.uk