"The key feature is a built-in custom TCP/IP stack capable of handling millions ...

dogma1138 · on Dec 27, 2014

TCP/IP is the common name for the Internet Protocol Suite which UDP is part of. Also since many mobile clients tend to issue requests over TCP (due to it being more reliable over mobile), and that many responses can be larger than 512 bytes now and EDNS is not really a standard DNS over TCP will probably over take UDP quite quickly.

The text below is from the RFC, and no it does not relates to zone transfers but to normal DNS queries :)

   All general-purpose DNS implementations MUST support both UDP and TCP
   transport.

   o  Authoritative server implementations MUST support TCP so that they
      do not limit the size of responses to what fits in a single UDP
      packet.

   o  Recursive server (or forwarder) implementations MUST support TCP
      so that they do not prevent large responses from a TCP-capable
      server from reaching its TCP-capable clients.

   o  Stub resolver implementations (e.g., an operating system's DNS
      resolution library) MUST support TCP since to do otherwise would
      limit their interoperability with their own clients and with
      upstream servers.

   Stub resolver implementations MAY omit support for TCP when
   specifically designed for deployment in restricted environments where
   truncation can never occur or where truncated DNS responses are
   acceptable.

And for the most important part :P

   Regarding the choice of when to use UDP or TCP, Section 6.1.3.2 of
   RFC 1123 also says:

      ... a DNS resolver or server that is sending a non-zone-transfer
      query MUST send a UDP query first.

   That requirement is hereby relaxed.  A resolver SHOULD send a UDP
   query first, but MAY elect to send a TCP query instead if it has good
   reason to expect the response would be truncated if it were sent over
   UDP (with or without EDNS0) or for other operational reasons, in
   particular, if it already has an open TCP connection to the server.

tptacek · on Dec 27, 2014

Why would TCP DNS overtake UDP "quite quickly" when it hasn't done so in the past decade? What's meaningfully changed about DNS recently?

_hyn3 · on Dec 28, 2014

Good call out. Maybe parent is just loosely referring to larger TXT records (or maybe DNSSEC or ipv6?) etc. My guess is that even these are probably a relatively small percentage of overall DNS traffic and that's not likely to change anytime soon.

So I can't imagine why records would all of a sudden exceed 512 bytes on avg either.

dogma1138 · on Dec 28, 2014

UDP isn't as reliable on mobile connections, many mobile clients issue TCP DNS requests with the UDP request at the same time and not waiting the "casual 5 seconds time out". DNS records also seem to grow and EDNS has not been adopted very well. Other things like Crhome's async DNS prefetch also seem to use TCP as much as UDP for some reason, especially to google DNS servers. The updated RFC mandates TCP support for regular DNS, and although i don't have a single reason (other than IPV6 records, TXT records use and DNSSEC) for why i have a strong feeling that mobile and browser optimizations are a good reason for that. When your browser does DNS pre-fetch from the DOM it becomes much more efficient to open a single TCP connection and issue all of the DNS requests (and with CDN's, adds, capthcas, social media and 3rd party content you can easily get to 20+ distinct DNS records per page) over it rather than issue individual async DNS queries over UDP. This will both be faster and more importantly more reliable for the next step which is the TCP-preconnect once it has resolved all the DNS records from the DOM even before loading it fully.

tptacek · on Dec 28, 2014

When your browser does DNS pre-fetch from the DOM it becomes much more efficient to open a single TCP connection and issue all of the DNS requests (and with CDN's, adds, capthcas, social media and 3rd party content you can easily get to 20+ distinct DNS records per page) over it rather than issue individual async DNS queries over UDP.

How are there any fewer RTTs with TCP DNS than there would be with UDP? I'm not seeing the efficiency here.

JoeAltmaier · on Dec 28, 2014

There is a per-packet cost to processing requests. TCP can bundle them; UDP cannot. Is it large enough to matter? I don't know. But the cost includes evaluating context of the requesting entity, which might not be meaningful for DNS queries. Maybe if they are related, some working-set caching would occur.

RTT might not improve at all. But lag might. Scripts often make the mistake of asking for information when they need it. Instead of before they need it, so it will be ready when needed. The suggested approach would pre-load the DNS info and might reduce lag.

tptacek · on Dec 28, 2014

You're saying that TCP DNS routinely sends multiple requests in a single TCP segment?

JoeAltmaier · on Dec 28, 2014

TCP sends successive data elements together, at least as part of the Nagle algorithm. I have no idea if the way DNS uses TCP can trigger Nagle.

In fact my own UDP protocol does something similar to Nagle as well. There's no good reason UDP protocols can't pick and choose what features they include. But most don't.

tptacek · on Dec 28, 2014

I (a) don't think TCP DNS routinely stuffs two requests in a single TCP segment and (b) don't believe TCP DNS is ever more performant than UDP DNS, including on mobile --- UDP gets a head start from not having a 3WH, and doesn't have rate limiting, which TCP does. TCP headers are also much larger than UDP headers.

Even the reliability argument doesn't make sense. Yes, TCP is "reliable". But so is UDP DNS, and in exactly the same way: if a request or response is dropped, it's retransmitted.

Nagle, for what it's worth, is an HN contributor. You could just ask him. :)

JoeAltmaier · on Dec 28, 2014

Agreed an all counts. Its a stretch to imagine TCP is better at performance.

{edit} though performance isn't really about wire time or packet size - its about cpu time on either end plus buffering. Including router time since that's a cpu in the path.

wging · on Dec 27, 2014

This is RFC 5966, for those who wondered: http://tools.ietf.org/html/rfc5966

cperciva · on Dec 27, 2014

I wrote a basic TCP stack for a honeypot research project. It is hard and incredibly complex.

Yes and no. Most of the complication comes from extra functionality (segmentation offload, checksum offload, SACK) or from functionality which is required by the standard but not relevant for a DNS resolver (congestion control, window management, TCP timers).

If all you're doing is accepting a TCP connection, reading a small request, and writing a small response back, you can remove about 90% of the code from your TCP stack.

tptacek · on Dec 27, 2014

Could you be a little more specific about what's incredibly complex about writing an interoperable TCP stack?

That aside: if I had to guess, this would be Robert Graham's 10th IP stack. He's been doing this (specifically) since the late 1990s.

sanxiyn · on Dec 28, 2014

I don't write a TCP stack, but Juho Snellman writes a TCP stack for living, and I found the following anecdote on writing an interoperable TCP stack interesting.

http://snellman.net/blog/archive/2014-11-11-tcp-is-harder-th...

TLDR: There are TCP implementations that can't handle SYN retransmission which you have to interoperate if your TCP stack is the product.

colmmacc · on Dec 27, 2014

I'm not the OP, but I think it's fair to call it complex, and I'd pick three requirements out in particular.

1. Path reachability, MTU discovery and MSS interaction

When sending outbound packets, you have to correlate incoming ICMP error messages in case they signal a problem. If the problem is that the packet is too big, you have to figure out what the MTU really is (which can take repeated attempts), so that you know what MSS to use (for TCP, or fragmentation boundary for UDP). If the path is unreachable, you have to remember that too. In both cases, you need some kind of global book-keeping so that you can do the right thing across connections. Some protocols (like active FTP) implicitly rely on MTU discovery on one connection signaling the MSS for another connection, so everything has to be path based, rather than connection based. Messy.

2. State management for error correlation

O.k., so you've figured out how to fragment an outgoing datagram and know what boundary to use, but how do you handle incoming error messages related to the fragments? Even for UDP, or other "stateless" protocols you actually do have to keep state so that you can correlate those error messages to the packets you sent. When the error message comes back, it will have the IP ID of the fragment, but nothing else is guaranteed.

This goes for (1.) too, but ICMP error messages can also be recursive and nested, and for a correct implementation you need to consider how to handle ICMP error messages that were themselves triggered by ICMP error messages. Several userspace stacks get this wrong, and can't correctly handle MTU discovery for UDP, or double-error correlation.

3. Heuristical and inconsistent caps on state

Many TCP implementations support selective acknowledgements and duplicate ack signalling, but what are their tolerances, just how much data can be retransmitted or handled out of order before you have the resend the whole window? there's no way to know, and if you get it wrong you can end up stalling a TCP connection for a significant delay. Unfortunately there are no simple limits, and in some cases the volumes are related to bandwidth delay products, necessitating some kind of integral control loop.

The problem with all of these is that they only show up "sometimes" and with particular networks or TCP stacks. I've limited these to interoperability issues - but there are other tricky complexities. For example, when building a TCP stack, do you optimise for throughput and so batch reads/writes of many packets - or do you optimize for a correct RTT estimate, and do things more synchronously. It's not possible to have both (at least with today's NIC interfaces); sometimes RTT is critical (e.g. an NTP implementation, a real-time control system or just any system that needs to rapidly recover from packet loss) , sometimes throughput is more important. Definitely complex.

tptacek · on Dec 28, 2014

Getting a performant TCP is certainly hard. So, for that matter, is getting congestion control right --- TCP congestion control is devilishly hard. But you don't have to do either of those things to get an interoperable TCP!

marktangotango · on Dec 27, 2014

Seems like this user land tcp stack could be the basis of types of c10m servers. Are tbere existenting userland tcp stacks available? I'm not familiar with any.

wmf · on Dec 27, 2014

http://shader.kaist.edu/mtcp/

http://www.openonload.org/

There's also some work showing that you can achieve very high performance with kernel TCP: https://www.usenix.org/conference/osdi14/technical-sessions/...

robertgraham · on Dec 28, 2014

I build stacks that are highly customized for the target solution and are impractical for general purpose use.

A good general purpose stack is the 6windgate stack. I know nothing about it personally, but I know that a lot of people do use it successfully.

zzzcpan · on Dec 27, 2014

Also take a look at NetBSD's rump kernels and PicoTCP. Nobody mentioned them.

daurnimator · on Dec 27, 2014

Check out snabb switch: http://www.snabb.co/ https://github.com/SnabbCo/snabbswitch/wiki

tptacek · on Dec 27, 2014

lwip would be a well-known example, right?

justinsb · on Dec 27, 2014

Absolutely, but I would say that this has the potential to be a lot more secure than a traditional (full) TCP/IP stack. Most queries are UDP (one packet), and we would expect that TCP connections should only last a few packets. TCP connections that don't match a handful of patterns would be suspicious (IMHO) and should probably just be dropped.

Of course, then you find some weird version of Windows XP that this breaks :-(

jvehent · on Dec 27, 2014

> TCP connections that don't match a handful of patterns would be suspicious (IMHO) and should probably just be dropped.

I used to believe that. It is unfortunately not true. TCP packets will come in all shapes and forms, and all must be treated equally, which is what makes TCP stacks so incredibly complex to implement.

The Linux TCP stack is quite safe and fast. Especially with tight integration with NICs hardware (checksum offloading and the like). I'm a bit unsure what a custom stack in userland can provide that the standard kernel stacks don't have.

robertgraham · on Dec 28, 2014

The Linux TCP stack is NOT fast. My DNS server can respond to DNS queries faster than the simplest of in-kernel echo servers (like ICMP ping or UDP port 7 echo). That's with the entire DNS overhead of parsing the DNS protocol, looking up the name in a very large database (like the .com zone), doing compression, and generating a response.

The upshot is this: going through the Linux stack, a DNS server is limited to around 2 million queries/second. Using a custom user-mode stack, it can achieve 10 million queries per second.

modusponens · on Dec 28, 2014

Do you have any insight about how other stacks fare on that same workload? (*bsd, qnx, ms-windows, minix,osx...)

tptacek · on Dec 27, 2014

Can you be more specific about the different-but-equal forms TCP packets take? You can dive into the details; I've implemented TCP stacks too.

thirsteh · on Dec 27, 2014

Performance. This isn't for security, or it wouldn't be yet another critical piece of software written in C.

nitinics · on Dec 27, 2014

With the advent of DNSSEC, IPv6 and EDNS0 you're more likely to see DNS responses >512 bytes, therefore falling back to TCP (with truncate bit set). Therefore it is strongly recommended you do not drop/block tcp 53 on your middleboxes , firewalls etc.

zaroth · on Dec 27, 2014

Another interesting use-case for TCP in DNS is for anti-DDoS. If a botnet is abusing your DNS server to flood traffic to their target, flipping the 'TC' bit which will force the request to come back over TCP, exposing the spoof.

My long-winded write-up here: http://opine.me/cert-advisory-on-dns-amplification-offers-li...

_hyn3 · on Dec 28, 2014

Very interesting, thx for the link!

colmmacc · on Dec 27, 2014

EDNS0 includes a mechanism for clients/resolvers to signal that they can handle a large/fragmented UDP response. At this point about 85% of requesters can handle UDP responses of at least 4K. For the moment, DNSSEC and EDNS0 are making falling back to TCP far less common than it used to be.

That may change, as some providers are starting to put smaller limits on their response sizes (to mitigate certain kinds of DDOS and response spoofing attacks). Of course permitting TCP 53 is required for DNS to work; as is permitting UDP fragments (which poorly configured firewalls often block too).

justinsb · on Dec 27, 2014

Sure; I'm not suggesting dropping TCP entirely, just that e.g. a 1MB request / response is not going to be legitimate, and so you can simply not implement a lot of TCP's complexity (e.g. window scaling)