It’s not always possible or feasible to collect the four types of information useful for conducting NSM, for the usual reasons (“cost of software/hardware/people/time” being near the top of the list). However, this doesn’t mean that the game is lost before it’s even begun – Sguil, for example, doesn’t have any facility for statistical alerts, but that doesn’t mean that it’s not a powerful tool.
The following tale took place where only session and alert data were available. Despite this apparent lack of information, we were able to solve the mystery without the intervention of Scooby and the gang, and we were able to dodge the temptation to take an IPS alert at face value (a clear case of defensive avoidance!)
The network in question was purely a client site; there were no public servers to worry about. Network security was pretty formulaic:
There’s a PIX doing the standard firewall/NAT job, and an inline IPS scrutinising everything that goes in or out. The logging level on the PIX is turned all the way up to “debugging”, so we get an export of session data in the form of messages like PIX-6-302013/PIX-6-302014 etc. Both the IPS and the PIX are reporting to a central log collector, a Cisco CS-MARS in this case.
The trigger for this investigation was an alert from the IPS. Lots of them, in fact. The signature that fired was one we’d never seen before, which either means another class of false positive to tune out or that something interesting is actually happening.
Even more interesting was the fact that the signature wasn’t just your typical brute-force pattern matching job – it was one of Cisco’s “anomaly detection” signatures that fires on behaviour observed over time. The signature denotes a TCP scanner hard at work scanning external IP addresses. The signature writeup is frustratingly lacking in detail; what it means when it says “scanning” would be a useful thing to know, for starters.
Never mind. NSM Ninjas don’t need vendor writeups. We can reverse engineer a signature’s firing conditions ourselves.
Looking at the alerts we’d got, we can see:
- There were zillions of alerts over a five-ish minute period.
- The alerts cite five distinct internal IP addresses as being those doing the “scanning”.
- At the end of the five-ish minutes, the alerts stop as abruptly as they started.
Hmm. Let me see if I’ve got this straight. Five of my hosts all start “scanning” at the same time, they carry on scanning for five minutes, and then they all stop at the same time?
Maybe we really do have a worm outbreak here. But why only five hosts? Why did they stop at the same time? Is there a command and control element at work here? Are my hosts pwned? Do I trust the IPS alerts and start rebuilding the “compromised” hosts? Questions pour down like rain, and we’re in for some serious flooding unless we wheel out the umbrella-and-wellies combo that is NSM and Vigilance to Detail.
First, let’s see exactly what these hosts were doing during this five minute window. We’ve got no full-content capture here, remember, so we’re going to have to hit the session data from the PIX pretty hard. Using this, we can see that each of the five hosts tried to contact between two and three hundred non-local IP addresses in our five minute material time frame (MTF). This is definite worm behaviour. There’s a small degree of crossover between the pools of target IP addresses, but there’s no one address that they all have in common (i.e., there’s no single command and control channel).
Next, we can check the destination port – if we’re dealing with a worm, this will be a good clue to which one it is. All the ports were TCP, but the port numbers were random. All over the place. This doesn’t seem like worm behaviour to me – random IP addresses I can understand, but random ports makes little sense.
Now we can look at data volumes – how much data did our “scanners” actually send. We get another interesting answer – not a single byte of payload was carried. This could possibly be explained by the random nature of the destination ports – given the utter shotgun nature of the “scanning”, I guess it’s not too likely that we’re going to hit an open port.
So we have a frenzy of totally ineffective scanning, with the attackers apparently synchronised somehow. There’s not too much more we can learn from the session data at this point, so we have to look for other clues. The plan is to see what kinds of events the PIX was splurting out in the thirty seconds before and after the first IPS alarm – we’re after the catalyst for the scanning, if there is one.
All the while, I can’t help but think I’ve seen these five source IP addresses together before, but I can’t quite put my finger on it…
Anyway, back to the catalyst seeking. The ad-hoc query interface on the CS-MARS is pretty reasonable, and it’s really easy to ask it for a list of event types seen from a particular device for a particular MTF. Taking the start of the scanning as the start point and working from T-30 seconds to T+330, we notice a few things:
- There seems to be a big gap in the events output by the PIX – it’s been totally silent during the initial period of scanning.
- During the latter phases of scanning, there were loads of these messages logged: “%PIX-3-305006: outbound portmap translation creation failed”. These are raised when the PIX can’t create a NAT translation, due to lack of resources, or a TCP protocol violation, etc.
- We also see a single instance of this: “%PIX-6-199002: Startup completed. Beginning operation”. This means that the PIX rebooted for some reason.
We can express this as a timeline:
Finally, I remember where I’ve seen the five IP addresses before, and all the pieces fall into place.
The five IP addresses are those of people who use Skype. Whilst it obviously has great merit as a piece of communications software, its use of apparently random destination IP addresses and ports plays merry hell with NSM reports based upon session data. For this reason, I run a daily report of Skype users so that I can exclude them from these reports if I need to (it’s easy to spot a Skype client starting up because it checks to see if it’s running the latest version – I look for which IP addresses are making the check).
After piecing together all the evidence, we come up with this:
- Five Skype clients start up. They connect to many many destination IP addresses on random ports.
- For whatever reason, the PIX crashes and reloads.
- The Skype clients don’t know this, and try to maintain their existing TCP connections (they must do some kind of keepalive).
- After a minute or two, the PIX has finished reloading.
- Whilst this is going on, the Skype clients are still trying their keepalives. Once the PIX is working again, the keepalives still fail because the PIX is a stateful firewall. Each keepalive only has the ACK flag set because it’s part of an existing session as far as Skype is concerned. However, the PIX hasn’t seen the start of the TCP session and therefore has no “state container” for it. This is the reason for all the “outbound portmap translation creation failed” messages, and also the reason why we didn’t see any actual payload transferred – the PIX dropped all of the keepalives.
- Meanwhile, the IPS (sitting in between the Skype clients and the PIX) is seeing all of this and is merrily firing it’s “External Scanner” signature.
- Eventually, the session timeout on all the Skype clients fires, and they all declare their existing sessions dead and re-establish them from scratch with SYN.
So, there we have it. The IPS alerts were false positives in this instance, caused by a tenacious piece of software and a flaky piece of hardware. Our lack of full-content capture wasn’t a problem – we solved the mystery without it, and even if we’d had it there wouldn’t have been anything to see in this case. Another victory for the umbrella-and-wellies combo!
Alec Waters is responsible for all things security at Dataline Software, and can be emailed at alec.waters(at)dataline.co.uk