Adventures in PIX Land

One of my job functions is supporting a mesh of VPN tunnels that crisscross the country for various clients utilizing various types of network hardware.  Most commonly it’ll involve a Cisco PIX or ASA in a central office and a Cisco SOHO router at a user’s house.  Generally the Internet connection at the user’s house is residential DSL or cable and can be slightly unreliable resulting in dropped connections which can leave the SOHO router in various states of confusion but generally running a “clear crypto“-type command on both ends to delete the association and re-establish the tunnel takes care of the trick.  A minor annoyance for the users and myself but they’re willing to put up with it for the opportunity to work from home.  This week I came across what, I thought, was another one of these issues but it turned out to be a lot more.  It ended up testing my troubleshooting skills, patience, and sanity.

The most common application running across the VPN tunnels is voice.  Users working remotely generally have IP phones connected back to the central office over the tunnel which allows them to take calls coming in to the central office as if they were there.  By nature, voice is one of the most connection-sensitive apps to run across a network so if there’s something wrong with a user’s internet connection, their phone is usually the first indicator and that’s exactly how this issue was reported to me.  When I spoke to the user they said that their phone wasn’t working but their internal instant messaging client was.  Odd considering they were both hosted across the same tunnel.  I pinged the phone system from the user’s desktop and got no returns.  I pinged the IM server and got returns.  I thought “Ok, something’s up with the tunnel.  Clear both ends and let it re-establish.”  Did that and same problem.  I checked the configs on both ends and nothing had changed over the past few weeks.  Rebooted the user’s router, still nothing.  Time to break out my troubleshooting setup. 

For troubleshooting firewall issues I VPN in to the PIX in question, turn on syslog debugging, and point the syslog messages out the “outside” interface of the firewall to the IP my VPN client has.  I use tftpd32 as my syslog server on my machines which nicely dumps the syslogs out to a text file for me.  I then take cygwin and “tail -f” the syslog output file and pipe that through grep to grab the relevant info. I’m looking for (big thanks to Peet G. for showing me how to do this when I was a young, inexperienced network analyst.)  I SSH’d into both the firewall and the router and turned up some debugging on the router, specifically “debug crypto isakmp” and “debug crypto ipsec”.  I sent a round of continuous pings from the client through to the subnet that wasn’t working, I saw the traffic match the access list configured for the tunnel, bypass NAT on the router, get encrypted (by looking at the counters in “show crypto ipsec sa detail”), saw it get decrypted on the PIX (using the same command as I used on the router), saw the traffic come back to the PIX from the server, match the access list for the tunnel going back out to the user, bypass NAT, but the detail for the IPSEC  sa for this specific access-list line showed:

#pkts encaps: 0, #pkts encrypt: 0, #pkts digest: 0

The detail for the access list line that was working showed packets being decrypted and encrypted.

Immediately I thought, “routing issue” the traffic must be going somewhere other than the outside interface to go through the tunnel.  I checked the routing tables on the PIX and nothing in there indicated the traffic to the user’s subnet would be going anywhere but outside.  Just to reaffirm this, I turned on reverse routing injection on the user’s router and re-established the tunnel , now I saw a specific entry in the routing table for the user’s subnet but traffic still wasn’t passing. 

By this time the day was over and the user had been using a cell phone for taking calls, not ideal they said, but they were happy to work that way until I had the issue figured out.  I explained the issue to a few of my other networking peers and most of them said the same thing, “something must be wrong in your configs.”  Only one of them said, “Have you tried rebooting the PIX?”  Reboot a PIX that’s been up for almost a year with no issues?  Surely that couldn’t be it, rebooting is typically a knee-jerk reaction to a problem that, while it may work, you also wipe away traces of the actual problem only to see it later.  Late that night I confirmed everyone was logged out and rebooted the firewall.  A minute later it came back up, I logged in to a server on the main network (that hadn’t been working for the user) and pinged the IP address of their phone and it WORKED!  So hopefully the firewall just needed a reboot, if the problem comes up again I’ll be quickly able to determine if it’s the same issue and get TAC on the line.  Hopefully this helps anyone out there who’s having the same issue as this, I spent the good part of a day troubleshooting this and was seriously beginning to doubt some of my professional skills.

~ by jverburg on March 20, 2009.

Leave a Reply