Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
BSD Operating Systems News

Zero-Copy TCP and UDP Output in NetBSD 74

-is writes "Jason R. Thorpe has recently added experimental code to NetBSD-current, that enables zero-copy for TCP and UDP on the transmit-side. These changes could mean significant performance improvements for FTP, WWW, and Samba servers. See Jason's announcement to the current-users mailing list for details." From the text: " On tests on an embedded system with limited memory bandwith, TCP transmit performance on 100baseTX-FDX went from ~6500KB/s to ~11100KB/s, a significant improvement." Excellent!
This discussion has been archived. No new comments can be posted.

Zero-Copy TCP and UDP Output in NetBSD

Comments Filter:
  • When will this be ported to Linux? :-P
  • by AdamBa ( 64128 ) on Monday May 06, 2002 @10:32PM (#3475064) Homepage
    I'm surprised nobody has come up with hardware to do this. The problem is the network card needs to know about the user's buffer ahead of time. In the old days (i.e. 5 years ago) you had a mix of Netbeui and IPX and TCP on a network and it didn't make much sense to make a card intelligent enough to figure out where to put packets.

    But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:

    1) packet sent by the client to a known port on the server
    2) a few packets to set things up and assign a dedicated server port
    3) lots of data blasting from the client to the dedicated server port
    4) some cleanup packets at the end

    Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.

    You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.

    The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.

    As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.

    - adam

    • This paper [utah.edu] describes a distributed shared memory system in which just such a hardware mechanism was used, called direct deposit. The paper has some information on it.
    • The problem is not necessarily that of early demultiplexing of incoming packets to the appropriate ports. The problem is that in order to achieve zero-copy you'll have to store the packet at the appropriate location in memory. I.e., you must store the incoming packet in the buffer location that the user level application uses for receive. Now, obviously you can not store the packet there unless the user has requested to receive more data (the buffer may be used for other purposes). A solution to this problem is to program all the applications so that their receive buffers are aligned on page-boundaries, and the page containing the receive buffer is only used for containing the receive buffer. This allows the kernel to receive incoming data onto empty pages and map these pages into the application when the application eventually issues a receive operation.

      Of course, there are more quirks to the problem than what I've discussed here. However, the point is that one can not easily implement zero-copy TCP receive without having well behaved applications (i.e., without modification of the application). Zero-copy TCP send is easier since the location of the outgoing packet is known to the kernel once the send operation starts.

      • Sure you would have to only do it with certain things were true -- the user had already posted a buffer, nobody else was using that port, etc. But a standard situation like somebody ftp'ing up a huge file would match those conditions.

        Now I confess I don't know anything about the internals of any Unix version. What I worked on was NT. And in NT this would be very easy (memory-management-wise I mean) as long as you had the user's buffer ahead of time...no need to have receive buffers aligned on a page boundary or anything. Since I don't know *nix, I don't know why that restriction would be needed. The card could receive anywhere in memory (doesn't need to be aligned)...and in NT you can map any user buffer into kernel space so any device driver can access it, then lock it to a physical address so the card can access it.

        - adam

        • True. One can indeed optimize for the common case (i.e., application is waiting to receive data), and handle other cases in another manner. Just to hand the ball over to you again, though; when receiving data into a user buffer you don't want to receive packet headers, etc., into the same buffer. As such, you must have some way to split up the receive so that the headers and tailer go into one location and the packed data goes into another. This can, depending on the networking hardware, occur completely at the network card (i.e., programmable cards) or at some higher level in the OS. I don't think that one general solution to the problem exist (although I must admit that I by no means qualifies as being knowledgeable in the field).
          • Yes, you need separate buffers for the headers (I tried to explain this in my first post but wasn't that clear).

            Let's say you get a 64K buffer from the user. So you hand it to the card and say "all data for port 0x1234 goes in here." Then you also give a chain of receive headers, 64 bytes or whatever. The processing after that should be pretty straightforward...when the card interrupts you with a packet received, it sets a flag saying that the data was put in a user buffer. Then the network card driver tells TCP it has a packet and that it has split header/data and the data is at location XXX. At that point the processing should be basically the same for TCP, verifying checksums and headers etc, but then at the last step where it would copy the data to the user's buffer, it just doesn't have to -- as long as the data was supposed to wind up at XXX.

            The tricky case is handling drops and dups and out of order. For example if the fifth packet in a transfer is received third, then TCP can't just move it to the right spot because the card may be using that spot to receive another packet. In general trying to tell the card "oh you should back up and start receiving new packets here instead of here" is tricky timing because a packet may be coming in while you are trying to tell the card that.

            Of course in situations like this TCP doesn't have to be perfectly optimized since you will likely need to retransmit anyway, but it shouldn't be terrible. And the card will also need to be given a set of general buffers for packets that are not to an expected port, or where the user buffer runs out of room, etc. Then TCP has to be clever about putting those packets in the right place.

            You could have the card be smarter and actually known about where to put each packet, it couldn't even do acks and retransmit requests...but you don't want to make it too complicated. Plus I think you want to avoid having the card need to interpret any part of the packet that is encrypted during an IPSEC session (which I don't know exactly where that begins). Some cards do IPSEC in hardware but that is another issue.

            And of course this only helps if the server is CPU-bound, as opposed to disk or network etc.

            - adam

    • A zero-copy receive path is a significantly harder problem to solve.

      Basically, devices DMA into host memory. These buffers must be preallocated, since you never know when a packet might arrive, and when it does, it needs to go into memory immediately, since the temporary storage in the Ethernet MAC itself is quite small.

      When the data arrives, we still don't know which application it is for. We have to parse headers, etc. to determine that. And once we do, we have a buffer that is:

      1. Not page-aligned.
      2. Not page-rounded.

      This makes it very difficult to "page flip" the buffer into userspace.

      The Trapeeze/IP project at Duke implemented a zero-copy receive for FreeBSD, but it required special modificaitons to the firmware on the Alteon ACEnic Gig-E interfaces they were using. Those interfaces aren't even manufactured anymore, and there are essentially no Gig-E interfaces on the market today which allow you to hack the firmware in such a way. So, their solution pretty much can't be used unless you have full control over the hardware that's going into your device (i.e. it's pretty much of use only to people building embedded systems from scratch).

      Therefore, in the absense of another solution, you are forced to perform at least one copy on the receive side: from the interface's receive buffer into a page-aligned/page-sized buffer in the socket. Once you have that, you *can* page-flip into user space, however, and since the copy across the protection boundary is usually more expensive than a copy within the same address space, so there's still some benefit that can be realized.

      • I understand the issues you bring up, but I still think it can be done. You are thinking of it as receiving into one of the standard buffers you have preallocated into the netcard's receive ring, and then somehow making this map into the user's buffer. That's not what I mean.

        I mean in certain cases (but cases that generally correspond to the ones you would care about, large transfers to a server), you have the user's buffer ahead of time, and furthermore there is just one client of the port, and TCP knows this because it has *assigned* the port to that client. For example when the ftp daemon is receiving a put command, it moves the client computer over to a random port, which it asks TCP to assign to it. Let's say TCP assigns port 0x9876. So TCP knows that only the ftp daemon is the only one using port 0x9876. And the ftp daemon has posted a big receive to get the data. So TCP can take that receive's buffer and hand it to the driver and say "all data [but not the headers] received on port 0x9876 goes in this buffer." Then the card packs the data in and in the normal case where every packet is received only once and in order, it works.

        The firmware mod you would have to do is instead of the card having one big ring of receive buffers, it needs a bunch of "per-port" buffers (plus one big ring as before). Now that takes some space to set up the control, but the buffers themselves are all user buffers, so there is not a *lot* more storage needed. The firmware needs to pick out the port number "on the fly" while it is receiving the packet, but I think it could to that.

        There may be alignment issues....maybe the card has to say "gee the next spot for packets to port 0x9876 has an address that is congruent to 3 modulo 4...so it has to put one extra byte somewhere (along with the header maybe) and then start its transfer at the 4-byte boundary.

        If you want to slap down credentials, I was one of the main designers of both NDIS (the transport <-> network card interface) and TDI (the transport <-> level above) interfaces in NT. And I wrote several transports and netcard drivers. So I know a bit whereof I speak.

        - adam

    • What a retarded thing to say. For one thing, there will soon be ipv6 TCP, is this the tcp you're referring to, or ipv4's version?

      There are too many legacy systems and apps out there, for me to have to worry if the replacement nic is going to try and outsmart me. That's something that happens all too often in windows, and I'll be damned if it happens in hardware without me bitching up a storm.

      I regularly see NetBEUI, DECnet, and IPX on the systems I work with, and even something stranger from time to time. That was undoubtedly one of ethernet's core strengths, the ability to be protocol agnostic. On every other physical/logical protocol, you always had to jump through hoops to use a different protocol, but ethernet just doesn't care.

      I do care about more than just TCP/IP. If you weren't a fool, you would too.

      Suggestion to moderator: Flamebait, yes. Troll, no. If you are too stupid to see the difference, then I'm sure the meta-moderator won't be.
  • by Jamie Lokier ( 104820 ) on Monday May 06, 2002 @10:56PM (#3475187) Homepage

    True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.

    In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.

    A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.

    Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.

    Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.

    But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.

    You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).

    By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.

    Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.

    Enjoy,
    -- Jamie Lokier

    • by lamontg ( 121211 ) on Tuesday May 07, 2002 @02:59AM (#3475802)
      A basic bit of research by running 'man sendfile' on a FreeBSD system would have told you this:

      SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)

      NAME
      sendfile - send a file to a socket

      LIBRARY
      Standard C Library (libc, -lc)

      SYNOPSIS
      #include sys/types.h
      #include sys/socket.h
      #include sys/uio.h

      int
      sendfile(int fd, int s, off_t offset, size_t nbytes,
      struct sf_hdtr *hdtr, off_t *sbytes, int flags);

      DESCRIPTION
      Sendfile() sends a regular file specified by descriptor fd out a stream
      socket specified by descriptor s.

      [...]

      IMPLEMENTATION NOTES
      The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
      has been optimized so that copying of the file data is avoided.
      • And you also forgot to mention:

        HISTORY

        sendfile() first appeared in FreeBSD3 .0. This manual page first appeared in FreeBSD 3.1.
      • You're right, (hangs head in shame...).
        I had a nagging feeling when I wrote the article ;-)

        Ah well, at least I managed to explain some of the differences between mmap-write and sendfile.

        Thanks,
        -- Jamie

        • Right, but you completly failed to explain why the sendfile-only/no-mmap zero-copy in Linux is better. Instead you mention that one should create files in tmpfs and use sendfile on that. That can hardly be zero-copu anymore, can it?

          Regards, Tommy
    • hoo baby (Score:3, Insightful)

      by AdamBa ( 64128 )
      Don't know anything about Linux, but compared to NT this seems incredibly complicated. In NT you can take any user buffer, probe it, map it, and lock it. Don't have to worry about "user space behaviour" and whatnot. If the user is doing synchronous write()s (or whatever) the call won't even return until the kernel code tells says it is done with it. And for asynchronous, you would hope the user won't be walking all over the buffers it has handed over until the call is done. And in all cases the kernel components handle the buffers in the same way, you don't need your network card to indicate with a flag whether it can do zero-copy sends (?!?!?!?!?!?!?).

      - adam

      • I know that NT has been able to pin user buffers forever, and that can be used for synchronous I/O. It's not suitable for asynchronous I/O, though: it's no good "hoping" the user doesn't overwrite buffers until the TCP transmitted data has been acknowledged. Applications are often written to assume they can call the equivalent of the unix write() call, and then overwrite their application buffer to prepare for another write.

        You said that all kernel components handle buffers in the same way, and thus all network cards handle zero-copy sends. In fact this is not true for all supported network cards: some cards simply don't have the hardware to transmit a packet that is in physically discontiguous memory. And that's what you have when doing zero-copy sends from user buffers. So, either NT's kernel or the vendor-supplied device driver must copy the data into contiguous memory, or onto the card itself (typical for ISA cards). When that happens it isn't zero-copy any more, although it is transparent to the application -- just like the Linux and BSD mechanisms!

        have a nice day,
        -- Jamie

        • I meant asynchronous I/O where the app is expecting it to be asynchronous. With sockets there is some way to wait for a write to complete (which I don't remember offhand, I'm much more familiar with the kernel parts), and using the NT API you can do a wait...I think even with Win32 file APIs you can wait. But I think the default is to have I/O be synchronous, so a dumb app will work. And I disagree that you can't expect an app doing asynchronous I/O not to touch a buffer until the write is done. That's just a broken app. What if it is multithreaded and changes the buffer during the write? What if it zeros it out BEFORE the write?!? Etc etc.

          It's true some cards can't do a busmaster send from discontiguous buffers. So they do have to copy to contiguous memory, or to card memory, or do DMA, or whatever...but that's something the card driver has to just deal with. Nobody above the network card driver has to care.

          So it sounds like the network card flag you mentioned is just an indicator of the ability to send discontiguous buffers directly? That is, a status indication, not something an app has to worry about....so that's fine, I withdraw my ragging on that.

          - adam

    • Yes, Linux does have sendfile(2) while NetBSD does not. No argument there. However, sendfile(2) has some issues:

      1. It only works for sending complete files, and I seem to recall that it closes the file at the end of the transaction. This doesn't work for e.g. Samba servers, which need to send chunks of files.

      2. sendfile(2) doesn't work for data sourced from somthing other than a file. Consider a database server which maintains a memory-resident cache. Or consider piping output from a command, say, dump(8), over the network. Or consider the case of an iSCSI target device, which might have mmap'd a chunk of disk/file blocks to serve on-demand.

      My change works for those 2 (important!) cases above.

      That is not to say that NetBSD won't get a sendfile(2)-like mechanism in the future (it will probably get a splice(2) system call, which allows you to hook together 2 arbitrary file descriptors, one source, one sink -- essentially a generalization of sendfile(2)).

      It's also worth noting that the NetBSD zero-copy TCP/UDP transmit path doesn't require anything special from a device driver; the driver simply needs to be able to DMA from arbitrary memory addresses. And even if a device can't do this, you have still reduced memory bandwidth consumption by eliminating the copy from user space to kernel space.
  • We see a lot of posts indicating that *BSD is dying. OTOH, we see a lot (literally a LOT!) of enthusiastic and excited FreeBSD (*BSD) newbies looking for help with basic *BSD questions.

    An important question we all need to ask ourselves: do we want to be plain sideline observers, or do we choose to help grow *BSD? Again, I am sure all of you *BSD geeks will agree that it would be in our best interests to promote *BSD!!!

    This is not an infomercial on our part, but a request for all you "expert" *BSD geeks to give back to the *BSD cause by PARTICIPATING, visiting your favorite *BSD sites (or whatever else !!!), and by helping answer newbie questions.

    Newbies sometimes feel intimidated/overwhelmed by mailing lists, complex howtos, etc. and dont exactly know where to start. Some prefer asking simple questions in forums, or following simple howtos, etc. It would be in our best interest to encourage these folks and turn them towards FreeBSD, OpenBSD and NetBSD!! Encouraging them will promote *BSD committment (committers) with ports/applications, OS and security enhancements. *BSD needs our help!!
    • Funny post - BSD is now the highest volume *nix in the computer biz. Thanks to Apple shipping on Darwin. What's more, users *REAL USERS* don't give a hoot about "basic *BSD questions". They just want it to work.

      And it can. And it does.

      OK, so maybe this is a troll. But maybe it's insightful, and maybe it's just funny...

  • I wonder if OpenBSD will copy this new NetBSD code. Does anyone know which BSD has more users, OpenBSD or NetBSD? I hear few people ever talking about NetBSD, but many of the OpenBSD changelogs have comments like "copied this from NetBSD". Does OpenBSD simply "ripoff" NetBSD development?
    • > Does OpenBSD simply "ripoff" NetBSD development?

      Nope. The sharing of code and ideas among the various *BSD implementations is a Good Thing(TM). Developers from more than one groups that join their efforts and write code that is easy to port, clean, and useful to more than one of the BSDs are out there. And they have been doing a great job for quite some time. There's no ripping off, but a cooperative spirit that we should be glad for.

      To use your example, what if the "ripping off" of code that OpenBSD is accused for, helps in revealing some bug that is caused by assumptions that are true only for NetBSD? Is then the "ripping off" justified and part of a "development process" or is it still a bad thing? What if the OpenBSD folks write back to the original authors and help in fixing possible problems with the NetBSD codebase? Aren't they then assisting in making NetBSD a better OS too?

      - Giorgos

    • The neat thing about information is that copying it does not put any burden on the original author. Theo forked NetBSD -- you could too, if you wanted. It hasn't exactly killed NetBSD. Changes flow between all three projects.

      Of course, I've probably just been trolled . . . .
    • As others said OpenBSD was an offshoot of NetBSD. NetBSD itself started life as a system that combined the Jolitz's 386BSD with collected patches so you could say that NetBSD "ripped off" 386BSD. Since all the software is free nobody is ripping anyone else off though. Since OpenBSD is based in Canada they can ship their system with strong encryption since they are not subject to the USA's fascist crypto laws.

  • Can someone tell me if Overlapped IO under Win32 is also zero-copy? It seems like it could be - basically you pass buffers to the OS, and the OS lets you know when it's finished with them. This works for input and output.

    The other nice thing about OIO is that you don't have to populate your FDSET every time you do a select - which means if you're writing a server app with thousands of simultaneous connections, it's a whole lot faster.

    Is there a Linux/BSD equivalent to this?

For God's sake, stop researching for a while and begin to think!

Working...