Yesterday morning I’ve been notified, by the monitoring platform running at a customer site (NAGIOS, actually), that two distinct DNS services (two BIND instances running on two different Linux hosts – different distributions and different bind versions) gone down at mostly the same time:
[08-05-2015 02:09:33] SERVICE ALERT: srv-*****;dns;CRITICAL;HARD;3;CRITICAL - Plugin timed out while executing system call [08-05-2015 02:08:33] SERVICE ALERT: srv-*****;dns;CRITICAL;SOFT;2;CRITICAL - Plugin timed out while executing system call [08-05-2015 02:07:33] SERVICE ALERT: srv-*****;dns;CRITICAL;SOFT;1;CRITICAL - Plugin timed out while executing system call [...] [08-05-2015 02:07:43] SERVICE ALERT: fw-*****;bind;CRITICAL;HARD;3;PROCS CRITICAL: 0 processes with command name 'named' [08-05-2015 02:06:43] SERVICE ALERT: fw-*****;bind;CRITICAL;SOFT;2;PROCS CRITICAL: 0 processes with command name 'named' [08-05-2015 02:05:44] SERVICE ALERT: fw-*****;bind;CRITICAL;SOFT;1;PROCS CRITICAL: 0 processes with command name 'named'
Even tough the srv-***** was a slightly old server, with a not really updated BIND version, fw-****** was recently reinstalled and was much in-line with updates. So the fact that both BIND went down mostly at the same time was a clear element suggesting further investigations.
As for the logging infrastructure, every BIND instance as well as every HOST running at the site, are sending LOG messages to a central server. As for BIND, DNS queries are also logged.
A quick look to the query-log for both servers, raised this:
Aug 5 02:03:42 RELAY=NONE TYPE=SRV HOST=srv-***** LEVEL=local6.info TAG=named[24257]: MESSAGE= 05-Aug-2015 02:03:42.542 client 162.253.67.219#38969: query: version.bind CH TXT + Aug 5 02:03:42 RELAY=NONE TYPE=SRV HOST=srv-***** LEVEL=local6.info TAG=named[24257]: MESSAGE= 05-Aug-2015 02:03:42.717 client 162.253.67.219#38969: query: foo.bar ANY TKEY + [...] Aug 5 02:03:23 RELAY=LGW1 TYPE=FW HOST=fw-***** LEVEL=local6.info TAG=named[18714]: MESSAGE= 05-Aug-2015 02:03:23.030 client 162.253.67.219#44649: query: version.bind CH TXT + Aug 5 02:03:23 RELAY=LGW1 TYPE=FW HOST=fw-***** LEVEL=local6.info TAG=named[18714]: MESSAGE= 05-Aug-2015 02:03:23.205 client 162.253.67.219#44649: view external: query: foo.bar ANY TKEY +
So in both cases, the last query received by BIND was from the client 162.253.67.219. Also, even if I understand the reason of the first query, the second query sounded to me really weird (I had no problem in admitting to have never heard about TKEY record type before!)
Further investigation on system LOG messages, raised this:
Aug 5 02:03:42 RELAY=NONE TYPE=SRV HOST=srv-***** LEVEL=daemon.crit TAG=named[24257]: MESSAGE= message.c:2230: REQUIRE(*name == ((void *)0)) failed Aug 5 02:03:42 RELAY=NONE TYPE=SRV HOST=srv-***** LEVEL=daemon.crit TAG=named[24257]: MESSAGE= exiting (due to assertion failure) [...] Aug 5 02:03:23 RELAY=LGW1 TYPE=FW HOST=fw-***** LEVEL=daemon.crit TAG=named[18714]: MESSAGE= message.c:2311: REQUIRE(*name == ((void *)0)) failed Aug 5 02:03:23 RELAY=LGW1 TYPE=FW HOST=fw-***** LEVEL=daemon.crit TAG=named[18714]: MESSAGE= exiting (due to assertion failure)
So now it was clear that both BINDs received a DNS query for a TKEY record and such queries were so strange/wrong/bad that raised some sort of expection, within BIND, so to let BIND shutdown itself! Wow! Really interesting, BTW.
Messages LOGged were really clear and quite uniques so to start serious searching on the web. A 30 second search on Google, with proper search terms, raised this article – A deep look at CVE-2015-5477 and how CloudFlare Virtual DNS customers are protected – on the CloudFlare blog. It was published on the Aug 04th and clearly mentioned the CVE-2015-5477.
So the whole scenario was clear: we were hitted by someone who succesfully operated a really recent vulnerability in BIND!
Aside from the technical issues presented above, lots of other technical/organizational issues are raised up (…one more time) by this very experience. As a start, in a near-perfect world:
- First of all, every IT staff should have a security-policy in place. The staff should include someone 100% focused on IT-security, ensuring that all vulnerabilities discovered were properly handled on running systems;
- LOG messages reaching a central LOG-server with a severity “crit”, should not stay un-noticed, by IT staff. This is relatively easy to achieve in really small environments; but not so easy as soon as you start dealing with tens of servers and hundreds of services;
- servers/services need to be kept updated. This might sound really stupid (as I’m the first one still dealing, in 2015, with several Ubuntu 8.04, Ubuntu 10.04, Debian Lenny, etc.). But it’s really needed (I’m writing this for myself!)
Anyway, as I don’t live in a perfect world (I surely don’t, but I bet you didn’t as well), I need to say also that:
- having human-resources involved in IT-security, expecially if you’re not running an IT-security focused business, is expensive. Really expensive. And at least here, where I live and work, it’s
really hardimpossible to find companies doing such kind of investments. This is a pity. But it’s the reality, at least here; - I’m really involved in log-management infrastructures and I’m confident that problem 2) can be solved quite easily (once you know how to do!). So stay tuned as… some other (hopefully interesting) blog posts will appear here, later on;
- Even tough it’s easy to say that systems needs to be kept updated, it’s not that simple. Definitely. And it’s as more difficult as older are your systems. And it’s more… more difficult, expecially if you’re one of those “elderly” sysadmin that, like me, have still not been succesfull in adopting 100% of system-automation tools (I’ve embraced the DEVOPS state on mind quite recently, BTW). So if you’re starting from scratch, please, keep this point in mind but… if you’re already managing an infrastructure since 5 or 10 years ago…. I’ll not blame you if some of your systems are outdated!
That’s all.
Stay tuned as… other contents are arriving!