Zero-Copy TCP and UDP Output in NetBSD 74
-is writes "Jason R. Thorpe has recently added experimental code to NetBSD-current, that enables zero-copy for TCP and UDP on the transmit-side. These changes
could mean significant performance improvements for FTP, WWW, and Samba servers. See Jason's announcement to the current-users mailing list for details." From the text: " On tests on an embedded system with limited memory bandwith, TCP
transmit performance on 100baseTX-FDX went from ~6500KB/s to ~11100KB/s,
a significant improvement." Excellent!
Re:Packet loss (Score:5, Informative)
As fow how noticeable they will be...according to the email, it increased from 52 megabits/second to almost 90 megabits/second. But transmitting over a series of hops (like on the Internet), your speed is gated by the slowest link or router, and it is doubtful the whole end-to-end chain will be able to handle 90 Mbps (or 52 Mbps for that matter). So even with a monster send window you would still need to stop transmitting well before you got an ack and could send more.
Of course on a local 100 Mbps Ethernet playing networked games this could be quite nice.
- adam
Re:Packet loss (Score:1)
More likely you will have a hundred processes each sending to a diffrent endpoint on the internet. The routers in between are just that, routers and designed from the ground up to handle fast packet switching. I'd say this will be very useable in the real world to increase performance.
As for games, I have not yet seen a LAN game that can't be run on a 10 Mbps ethernet network. 100 is better, but you need a massive (64 players?) game to even be able to notice the improvement.
good point (Score:2)
I guess in general avoiding copies is a good thing, it frees the CPU to do something else. You just may not always have something else for the CPU to do, but hey why not give it the opportunity.
- adam
Re:Packet loss (Score:1)
Port it to Linux!! (Score:1, Troll)
Re:Port it to Linux!! (Score:1)
Re:Port it to Linux!! (Score:1)
Why?
Re:Port it to Linux!! (Score:2)
Re:Port it to Linux!! (Score:1)
what about zero copy on receive? (Score:3, Interesting)
But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:
1) packet sent by the client to a known port on the server
2) a few packets to set things up and assign a dedicated server port
3) lots of data blasting from the client to the dedicated server port
4) some cleanup packets at the end
Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.
You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.
The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.
As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.
- adam
Re:what about zero copy on receive? (Score:2)
Re:what about zero copy on receive? (Score:3, Insightful)
Of course, there are more quirks to the problem than what I've discussed here. However, the point is that one can not easily implement zero-copy TCP receive without having well behaved applications (i.e., without modification of the application). Zero-copy TCP send is easier since the location of the outgoing packet is known to the kernel once the send operation starts.
Re:what about zero copy on receive? (Score:3, Insightful)
Now I confess I don't know anything about the internals of any Unix version. What I worked on was NT. And in NT this would be very easy (memory-management-wise I mean) as long as you had the user's buffer ahead of time...no need to have receive buffers aligned on a page boundary or anything. Since I don't know *nix, I don't know why that restriction would be needed. The card could receive anywhere in memory (doesn't need to be aligned)...and in NT you can map any user buffer into kernel space so any device driver can access it, then lock it to a physical address so the card can access it.
- adam
Re:what about zero copy on receive? (Score:1)
Re:what about zero copy on receive? (Score:3, Interesting)
Let's say you get a 64K buffer from the user. So you hand it to the card and say "all data for port 0x1234 goes in here." Then you also give a chain of receive headers, 64 bytes or whatever. The processing after that should be pretty straightforward...when the card interrupts you with a packet received, it sets a flag saying that the data was put in a user buffer. Then the network card driver tells TCP it has a packet and that it has split header/data and the data is at location XXX. At that point the processing should be basically the same for TCP, verifying checksums and headers etc, but then at the last step where it would copy the data to the user's buffer, it just doesn't have to -- as long as the data was supposed to wind up at XXX.
The tricky case is handling drops and dups and out of order. For example if the fifth packet in a transfer is received third, then TCP can't just move it to the right spot because the card may be using that spot to receive another packet. In general trying to tell the card "oh you should back up and start receiving new packets here instead of here" is tricky timing because a packet may be coming in while you are trying to tell the card that.
Of course in situations like this TCP doesn't have to be perfectly optimized since you will likely need to retransmit anyway, but it shouldn't be terrible. And the card will also need to be given a set of general buffers for packets that are not to an expected port, or where the user buffer runs out of room, etc. Then TCP has to be clever about putting those packets in the right place.
You could have the card be smarter and actually known about where to put each packet, it couldn't even do acks and retransmit requests...but you don't want to make it too complicated. Plus I think you want to avoid having the card need to interpret any part of the packet that is encrypted during an IPSEC session (which I don't know exactly where that begins). Some cards do IPSEC in hardware but that is another issue.
And of course this only helps if the server is CPU-bound, as opposed to disk or network etc.
- adam
Re:what about zero copy on receive? (Score:2, Insightful)
Basically, devices DMA into host memory. These buffers must be preallocated, since you never know when a packet might arrive, and when it does, it needs to go into memory immediately, since the temporary storage in the Ethernet MAC itself is quite small.
When the data arrives, we still don't know which application it is for. We have to parse headers, etc. to determine that. And once we do, we have a buffer that is:
1. Not page-aligned.
2. Not page-rounded.
This makes it very difficult to "page flip" the buffer into userspace.
The Trapeeze/IP project at Duke implemented a zero-copy receive for FreeBSD, but it required special modificaitons to the firmware on the Alteon ACEnic Gig-E interfaces they were using. Those interfaces aren't even manufactured anymore, and there are essentially no Gig-E interfaces on the market today which allow you to hack the firmware in such a way. So, their solution pretty much can't be used unless you have full control over the hardware that's going into your device (i.e. it's pretty much of use only to people building embedded systems from scratch).
Therefore, in the absense of another solution, you are forced to perform at least one copy on the receive side: from the interface's receive buffer into a page-aligned/page-sized buffer in the socket. Once you have that, you *can* page-flip into user space, however, and since the copy across the protection boundary is usually more expensive than a copy within the same address space, so there's still some benefit that can be realized.
Re:what about zero copy on receive? (Score:2)
I mean in certain cases (but cases that generally correspond to the ones you would care about, large transfers to a server), you have the user's buffer ahead of time, and furthermore there is just one client of the port, and TCP knows this because it has *assigned* the port to that client. For example when the ftp daemon is receiving a put command, it moves the client computer over to a random port, which it asks TCP to assign to it. Let's say TCP assigns port 0x9876. So TCP knows that only the ftp daemon is the only one using port 0x9876. And the ftp daemon has posted a big receive to get the data. So TCP can take that receive's buffer and hand it to the driver and say "all data [but not the headers] received on port 0x9876 goes in this buffer." Then the card packs the data in and in the normal case where every packet is received only once and in order, it works.
The firmware mod you would have to do is instead of the card having one big ring of receive buffers, it needs a bunch of "per-port" buffers (plus one big ring as before). Now that takes some space to set up the control, but the buffers themselves are all user buffers, so there is not a *lot* more storage needed. The firmware needs to pick out the port number "on the fly" while it is receiving the packet, but I think it could to that.
There may be alignment issues....maybe the card has to say "gee the next spot for packets to port 0x9876 has an address that is congruent to 3 modulo 4...so it has to put one extra byte somewhere (along with the header maybe) and then start its transfer at the 4-byte boundary.
If you want to slap down credentials, I was one of the main designers of both NDIS (the transport <-> network card interface) and TDI (the transport <-> level above) interfaces in NT. And I wrote several transports and netcard drivers. So I know a bit whereof I speak.
- adam
Re:what about zero copy on receive? (Score:2)
There are too many legacy systems and apps out there, for me to have to worry if the replacement nic is going to try and outsmart me. That's something that happens all too often in windows, and I'll be damned if it happens in hardware without me bitching up a storm.
I regularly see NetBEUI, DECnet, and IPX on the systems I work with, and even something stranger from time to time. That was undoubtedly one of ethernet's core strengths, the ability to be protocol agnostic. On every other physical/logical protocol, you always had to jump through hoops to use a different protocol, but ethernet just doesn't care.
I do care about more than just TCP/IP. If you weren't a fool, you would too.
Suggestion to moderator: Flamebait, yes. Troll, no. If you are too stupid to see the difference, then I'm sure the meta-moderator won't be.
Linux has better zerocopy TCP, and here's why (Score:4, Informative)
True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.
In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.
A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.
Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.
Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.
But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.
You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).
By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.
Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.
Enjoy,
-- Jamie Lokier
FreeBSD has zerocopy sendfile... (Score:5, Informative)
SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)
NAME
sendfile - send a file to a socket
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include sys/types.h
#include sys/socket.h
#include sys/uio.h
int
sendfile(int fd, int s, off_t offset, size_t nbytes,
struct sf_hdtr *hdtr, off_t *sbytes, int flags);
DESCRIPTION
Sendfile() sends a regular file specified by descriptor fd out a stream
socket specified by descriptor s.
[...]
IMPLEMENTATION NOTES
The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
has been optimized so that copying of the file data is avoided.
Re:FreeBSD has zerocopy sendfile... (Score:2, Interesting)
HISTORY
Ok, you're right. (Score:1)
You're right, (hangs head in shame...). ;-)
I had a nagging feeling when I wrote the article
Ah well, at least I managed to explain some of the differences between mmap-write and sendfile.
Thanks,
-- Jamie
Re:Ok, you're right. (Score:2, Insightful)
Regards, Tommy
Re:Linux has better zerocopy TCP, and here's why (Score:1)
VERSIONS
sendfile is a new feature in Linux 2.2. The include file
is present since glibc2.1.
Other Unixes often implement sendfile with different
semantics and prototypes. It should not be used in
portable programs.
Re:Linux has better zerocopy TCP, and here's why (Score:1)
In fact, apache 2.0 now does exactly this -- determines the version of sendfile() on your system and implements the appropriate code to use it.
Slashdot: Bored at work? Come explain elementary coding practices to 15 year old hax0rs!
hoo baby (Score:3, Insightful)
- adam
NT's method is the same as the new BSD code (Score:2, Insightful)
I know that NT has been able to pin user buffers forever, and that can be used for synchronous I/O. It's not suitable for asynchronous I/O, though: it's no good "hoping" the user doesn't overwrite buffers until the TCP transmitted data has been acknowledged. Applications are often written to assume they can call the equivalent of the unix write() call, and then overwrite their application buffer to prepare for another write.
You said that all kernel components handle buffers in the same way, and thus all network cards handle zero-copy sends. In fact this is not true for all supported network cards: some cards simply don't have the hardware to transmit a packet that is in physically discontiguous memory. And that's what you have when doing zero-copy sends from user buffers. So, either NT's kernel or the vendor-supplied device driver must copy the data into contiguous memory, or onto the card itself (typical for ISA cards). When that happens it isn't zero-copy any more, although it is transparent to the application -- just like the Linux and BSD mechanisms!
have a nice day,
-- Jamie
Re:NT's method is the same as the new BSD code (Score:2)
It's true some cards can't do a busmaster send from discontiguous buffers. So they do have to copy to contiguous memory, or to card memory, or do DMA, or whatever...but that's something the card driver has to just deal with. Nobody above the network card driver has to care.
So it sounds like the network card flag you mentioned is just an indicator of the ability to send discontiguous buffers directly? That is, a status indication, not something an app has to worry about....so that's fine, I withdraw my ragging on that.
- adam
Re:Linux has better zerocopy TCP, and here's why (Score:1)
1. It only works for sending complete files, and I seem to recall that it closes the file at the end of the transaction. This doesn't work for e.g. Samba servers, which need to send chunks of files.
2. sendfile(2) doesn't work for data sourced from somthing other than a file. Consider a database server which maintains a memory-resident cache. Or consider piping output from a command, say, dump(8), over the network. Or consider the case of an iSCSI target device, which might have mmap'd a chunk of disk/file blocks to serve on-demand.
My change works for those 2 (important!) cases above.
That is not to say that NetBSD won't get a sendfile(2)-like mechanism in the future (it will probably get a splice(2) system call, which allows you to hook together 2 arbitrary file descriptors, one source, one sink -- essentially a generalization of sendfile(2)).
It's also worth noting that the NetBSD zero-copy TCP/UDP transmit path doesn't require anything special from a device driver; the driver simply needs to be able to DMA from arbitrary memory addresses. And even if a device can't do this, you have still reduced memory bandwidth consumption by eliminating the copy from user space to kernel space.
Re:Linux has better zerocopy TCP, and here's why (Score:1)
Regarding Tx DMA alignment for cheap PCI cards... I have written a fair number of Ethernet drivers over the past several years, and I can't think of any chip that required any more than 4-byte alignment of the DMA buffer. The ones that spring to mind are the RealTek 81x9, Xircom X3201, and the VIA "Rhine" series... oh, and e.g. the Alchemy Au1000 built-in Ethernet MAC also requires 4 byte alignment.
But by far the common case is that the chip can DMA from an arbitrary byte location in memory. I don't really consider this a "special" feature. Rather, I consider chips that can't do this to be "crippled".
(Note, it IS fairly common for Ethernet chips to have stupid limitations on the *receive* side, specifically 4-byte alignment of the Rx buffer, which means the IP header ends up misaligned after the 14-byte Ethernet header. This is truly annoying, since software has to copy data to fix it up.)
Regarding the 3 things you have to do to do zero-copy. NetBSD's virtual memory system has a generic framework for handling "loaning" of pages from one VM object to another. The uvm_loan facility is currently used by pipe(2) and socket writes (new with my changes). That said:
1. Yes, we support full correctness of the written data. If another thread (or the same thread) touches the page before the kernel is finished with it, the loan is considered broken, and a copy-on-write fault is taken to resolve the situation.
2. This is really the same question as (1). The pages aren't marked as "write in progress", per se, but are marked as "loaned" (loans can be used for things other than just outbound I/O, although that is the most obvious use of the facility).
3. Yes, there is TLB maniuplation traffic. This is why you pick a threshold for using the loaning facility. Doing it for small writes would be stupid, since the copy would be less overhead than the TLB traffic. That said, even in an MP system, it's not too bad, since NetBSD uses explicit barrier operations for low-level VM operations (so that the machine-dependent "pmap" module can coalesce TLB operations if it would be beneficial to do so). In any case, the expense of TLB shootdown traffic is largely an implementation issue.
*BSD needs our help! not our pity :-) (Score:1)
An important question we all need to ask ourselves: do we want to be plain sideline observers, or do we choose to help grow *BSD? Again, I am sure all of you *BSD geeks will agree that it would be in our best interests to promote *BSD!!!
This is not an infomercial on our part, but a request for all you "expert" *BSD geeks to give back to the *BSD cause by PARTICIPATING, visiting your favorite *BSD sites (or whatever else !!!), and by helping answer newbie questions.
Newbies sometimes feel intimidated/overwhelmed by mailing lists, complex howtos, etc. and dont exactly know where to start. Some prefer asking simple questions in forums, or following simple howtos, etc. It would be in our best interest to encourage these folks and turn them towards FreeBSD, OpenBSD and NetBSD!! Encouraging them will promote *BSD committment (committers) with ports/applications, OS and security enhancements. *BSD needs our help!!
Re:*BSD needs our help! not our pity :-) (Score:1)
And it can. And it does.
OK, so maybe this is a troll. But maybe it's insightful, and maybe it's just funny...
OpenBSD and NetBSD? (Score:2)
I wonder if OpenBSD will copy this new NetBSD code. Does anyone know which BSD has more users, OpenBSD or NetBSD? I hear few people ever talking about NetBSD, but many of the OpenBSD changelogs have comments like "copied this from NetBSD". Does OpenBSD simply "ripoff" NetBSD development?
Re:OpenBSD and NetBSD? (Score:2, Insightful)
Nope. The sharing of code and ideas among the various *BSD implementations is a Good Thing(TM). Developers from more than one groups that join their efforts and write code that is easy to port, clean, and useful to more than one of the BSDs are out there. And they have been doing a great job for quite some time. There's no ripping off, but a cooperative spirit that we should be glad for.
To use your example, what if the "ripping off" of code that OpenBSD is accused for, helps in revealing some bug that is caused by assumptions that are true only for NetBSD? Is then the "ripping off" justified and part of a "development process" or is it still a bad thing? What if the OpenBSD folks write back to the original authors and help in fixing possible problems with the NetBSD codebase? Aren't they then assisting in making NetBSD a better OS too?
- Giorgos
Re:OpenBSD and NetBSD? (Score:1)
Of course, I've probably just been trolled . . .
Re:OpenBSD and NetBSD? (Score:1)
Re:OpenBSD and NetBSD? (Score:1)
Since OpenBSD is based in Canada they can ship their system with strong encryption since they are not subject to the USA's fascist crypto laws.
The USA crypto laws are no longer fascist. The Bureau of Industry and Security (BXA) [doc.gov] has lifted most restrictions on exporting free [gnu.org] cryptographic software.
Overlapped IO / Win32 (Score:2)
Can someone tell me if Overlapped IO under Win32 is also zero-copy? It seems like it could be - basically you pass buffers to the OS, and the OS lets you know when it's finished with them. This works for input and output.
The other nice thing about OIO is that you don't have to populate your FDSET every time you do a select - which means if you're writing a server app with thousands of simultaneous connections, it's a whole lot faster.
Is there a Linux/BSD equivalent to this?
Re:Overlapped IO / Win32 (Score:1)