Detecting encrypted traffic with frequency analysis

Let’s start with a little disclaimer:

I am not a cryptanalyst. I am not a mathematician. It is quite possible that I am a complete idiot. You decide.

With that out of the way, let’s begin.

NSM advocates the capture of, amongst other things, full-content data. It is often said that there’s no point in performing full-content capture of encrypted data that you can’t decrypt – why take up disk space with stuff you’ll never be able to read? It’s quite a valid point – one of the networks I look after carries quite a bit of IPSec traffic (tens of gigabytes per day), and I exclude it from my full content capture. I consider it enough, in this instance, to have accurate session information from SANCP or Netflow which is far more economical on disk space.

That said, you can still learn quite a bit from inspecting full-content captures of encrypted data – there is often useful information in the session setup phase that you can read in clear text (e.g., a list of ciphers supported, or SSH version strings, or site certificates, etc.). It still won’t be feasible to decrypt the traffic, but at least you’ll have some clues about its nature.

A while ago, Richard wrote a post called “Is it NSM if…” where he says:

While we’re talking about full content, I suppose I should briefly address the issue of encryption. Yes, encryption is a problem. Shoot, even binary protocols, obscure protocols, and the like make understanding full content difficult and maybe impossible. Yes, intruders use encryption, and those that don’t are fools. The point is that even if you find an encrypted channel when inspecting full content, the fact that it is encrypted has value.

That sounds reasonable to me. If you see some encrypted stuff and you can’t account for it as legitimate (run of the mill HTTPS, expected SSH sessions, etc.) then what you’re looking at is a definite Indicator, worthy of investigation.

So, let’s just ask our capture-wotsits for all the encrypted traffic they’ve got, then, shall we? Hmm. I’m not sure of a good way to do that (if you do, you can stop reading now and please let me know what it is!).

But…

…I’ve got an idea.

Frequency analysis is a useful way to detect the presence of a substitution cipher. You take your ciphertext and draw a nice histogram showing the frequency of all the characters you encounter. Then you can make some assumptions (like the most frequent character was actually an ‘e’ in the plaintext) and proceed from there.

However, the encryption protocols you’re likely to encounter on a network aren’t going to be susceptible to this kind of codebreaking. The ciphertext produced by a decent algorithm will be jolly random in nature, and a frequency analysis will show you a “flat” histogram.

So why am I talking about frequency analysis? Because this post is about detecting encrypted traffic, not decrypting it.

Over at Security Ripcord, there’s a really nifty tool for drawing file histograms. Take a look at the example images – the profile of the histograms is pretty “rough” in nature until you get down to the Truecrypt example – it’s dead flat, because a decent encryption algorithm has produced lots and lots of nice randomness (great terminology, huh? Like I said, I’m not a cryptanalyst or a mathematician!)

So, here’s the Crazy Plan for detecting encypted traffic:

  1. Sample X contiguous bytes of a given session (maybe twice, once for src->dst and once for dst->src). A few kilobytes ought to be enough to get an idea of the level of randomness we’re looking at.
  2. Make your X-byte block start a little way into the session, so that we don’t include any plaintext in the session startup.
  3. Strip off the frame/packet headers (ethernet, IP, TCP, UDP, ESP, whatever) so that you’re only looking at the packet payload.
  4. Perform your frequency analysis of your chunk of payload, and “measure the resultant flatness”.
  5. Your “measure of flatness” equates to the “potential likelihood that this is encrypted”.

Perhaps one could assess the measure of flatness by calculating the standard deviation of the character frequencies? Taking the Truecrypt example, this is going to be pretty close to zero; the TIFF example is going to yield a much higher standard deviation.

Assuming what I’ve babbled on about here is valid, wouldn’t it be great to get this into Sguil? If SANCP or a Snort pre-processor could perform this kind of sampling, you’d be able to execute some SQL like this:

select [columns] from sancp where src_randomness < 1 or dst_randomness < 1

…and you’d have a list of possibly encrypted sessions.

How’s that sound?

This post has been updated here.

Check out InfoSec Institute for IT courses
including computer forensics boot camp training.


Alec Waters is responsible for all things security at Dataline Software, and can be emailed at alec.waters(at)dataline.co.uk

About these ads

4 Responses to “Detecting encrypted traffic with frequency analysis”

  1. vmforno Says:

    It’s sound nifty.

    But now, I’m thinking in the payload and which protocols usually allows data transfer either inbound or outbound through a Firewall that I must be concern/worried … and all my thoughts points to HTTP, FTP, SMB and similar ones (those unencrypted and in enterprise environment).

    Then suddenly all collapsed, you have the same randomness in ciphered files or even with steganography.

    I hope you could understand what I meant, this time was harder to explain it, just my 2c.

  2. Alec,

    unfortunately I think that Don Weber’s original article shows a slick histogram tool but to my unscientific eye has a pretty giant flaw in it mathematically.

    I believe the distilled point he’s making is that encrypted data is random. No argument there, but by first problem is that he’s also making the claim that it’s more random than compressed data. If it were, the data it would still be compressible (more on that in a sec).

    In order to support the claim he’s taking what appears to be a 800-900KB zip file and comparing it to a ~200MB truecrypt volume (estimates from taking the estimated mean Y-axis and multiplying it by 256 for all 2^8 characters). I think what we’re actually seeing is the effect of a sample size that is 500x larger than the .zip, .gz, or .bz2 files.

    Now I should have probably tested this before I spoke out, but I’d wager if you looked at histograms of a 200MB gzip file and a 200MB truecrypt volume you would not be able to tell the difference. If you could detect a discernable pattern, in theory you’d be able to compress that file more well.

    Assuming it was a valid method, if you had to wait until tens or hundreds of encrypted network traffic went across the wire before the pattern went “flat” enough you’d be shutting the barn door well after the cows got out.

    If I’m missing something here, I’d definitely like to know, because a good deal of my current assumptions on crypto and compression are tied up in the previous two paragraphs.

    • Hi Dave, Victor,

      Thanks for the comments!

      I totally take your point about randomness in compressed vs encrypted files. I tried Don’s tool on a 200MB zip, but it choked :)

      Instead, a 50MB zip looks like this:

      If I AES encrypt that same file it looks like this:

      Perhaps the “flatness” of the histogram of a compressed file is a function of the compression utility used?

      In terms of volume needed to get meaningful results, you’re right – from my testing, you seem to need a few megabytes of sample data to get something “flat enough”.

      However, I wasn’t thinking of using this as a preventative measure (i.e., to stop stuff I thought might be encrypted) – instead, I wanted another way to express my network in a report or other visualisation. I tend to do a lot of retrospective security analysis, like looking through reports that describe my network in terms of ports, protocols, IP addresses, volumes, event types, etc.; they’re usually pretty good at showing up interesting stuff. Some kind of report showing me when, where, and how much “possibly encrypted” traffic exists on my network would definitely be interesting to me.

      I’m apparently not the first to look at this. Julien Olivain has a package called net-entropy, although it’s not available for download at the moment. I’ll try to contact him; I’d love to experiment with it.

      alec

      • Alec, Thanks for compiling the new graphs. The AES is clearly flatter at 50MB than the ZIP, so this is quite interesting. I may run a quick bzip2 and gzip to experiment some with this myself. net-entropy also looks quite interesting. I agree with you that identifying unexpected encryption channels in network traffic definitely has value, whether proactive or retroactive.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 33 other followers

%d bloggers like this: