Zero-Copy TCP and UDP Output in NetBSD

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Zero-Copy TCP and UDP Output in NetBSD 74

Posted by chrisd on Monday May 06, 2002 @10:34PM from the who-is-this-zero-guy-anyhow dept.

-is writes "Jason R. Thorpe has recently added experimental code to NetBSD-current, that enables zero-copy for TCP and UDP on the transmit-side. These changes could mean significant performance improvements for FTP, WWW, and Samba servers. See Jason's announcement to the current-users mailing list for details." From the text: " On tests on an embedded system with limited memory bandwith, TCP transmit performance on 100baseTX-FDX went from ~6500KB/s to ~11100KB/s, a significant improvement." Excellent!

This discussion has been archived. No new comments can be posted.

Zero-Copy TCP and UDP Output in NetBSD

Load All Comments

Search 74 Comments Log In/Create an Account

Comments Filter:

- Re:Packet loss (Score:5, Informative)
  
  by AdamBa ( 64128 ) writes: on Monday May 06, 2002 @11:23PM (#3475029) Homepage
  
  You know when the packet was received when you get an ack for it (for TCP, with UDP you can just dump it to the network card and forget about it since it's unreliable delivery). Obviously in this situation you need to hold on to the user's buffer until you have gotten an ack, otherwise you don't have the data until you need to retransmit. But you want to optimize for the general case which is you do get an ack real quick and then you can just return the buffer...I suppose a clever TCP implementation using this would have some spare buffers, and after some time without an ack (maybe the first retransmission, maybe sooner), it would copy the data and give the user's buffer back. But again that is a rare case so the extra copy there doesn't matter.
  As fow how noticeable they will be...according to the email, it increased from 52 megabits/second to almost 90 megabits/second. But transmitting over a series of hops (like on the Internet), your speed is gated by the slowest link or router, and it is doubtful the whole end-to-end chain will be able to handle 90 Mbps (or 52 Mbps for that matter). So even with a monster send window you would still need to stop transmitting well before you got an ack and could send more.
  Of course on a local 100 Mbps Ethernet playing networked games this could be quite nice.
  - adam
  
  Parent Share
  twitter facebook
  - Re:Packet loss (Score:1)
    
    by iansmith ( 444117 ) writes:
    
    You are assuming tha the whole 90 mb/s stream is going to one site.
    
    More likely you will have a hundred processes each sending to a diffrent endpoint on the internet. The routers in between are just that, routers and designed from the ground up to handle fast packet switching. I'd say this will be very useable in the real world to increase performance.
    
    As for games, I have not yet seen a LAN game that can't be run on a 10 Mbps ethernet network. 100 is better, but you need a massive (64 players?) game to even be able to notice the improvement.
    - good point (Score:2)
      
      by AdamBa ( 64128 ) writes:
      
      You are right <smacks self in forehead>.
      I guess in general avoiding copies is a good thing, it frees the CPU to do something else. You just may not always have something else for the CPU to do, but hey why not give it the opportunity.
      - adam
  - Re:Packet loss (Score:1)
    
    by neptuneb1 ( 261497 ) writes:
    
    While it is true your server would be unable to send 52Mbps to a single client, chances are that your traffic goes over multiple different links once it leaves your ISP (possibly even within your ISP). If you pay for a good service, their internal structure should be able to handle your 90Mbps. It's the same reason that people invest in gigabit ethernet. The internet can't handle anywhere NEAR that bandwith end-to-end, but split among multiple clients, it's not so unrealistic.
Port it to Linux!! (Score:1, Troll)

by redcliffe ( 466773 ) writes:

When will this be ported to Linux? :-P
- - Re:Port it to Linux!! (Score:1)
    
    by ozzmosis ( 99513 ) writes:
    
    Windows didn't tcp code until 3.11 (remember windows for workgroups?)
- Re:Port it to Linux!! (Score:1)
  
  by jo42 ( 227475 ) writes:
  
  > When will this be ported to Linux?
  Why?
  - Re:Port it to Linux!! (Score:2)
    
    by redcliffe ( 466773 ) writes:
    
    Well I could say because BSD is dying, but I won't. It's a cool thing and worth porting.
- Re:Port it to Linux!! (Score:1)
  
  by elbles ( 516589 ) writes:
  
  IIRC, the Linux 2.4.x kernels *already* support zero-copy TCP operations; not sure about UDP though . . .
what about zero copy on receive? (Score:3, Interesting)

by AdamBa ( 64128 ) writes: on Monday May 06, 2002 @11:32PM (#3475064) Homepage

I'm surprised nobody has come up with hardware to do this. The problem is the network card needs to know about the user's buffer ahead of time. In the old days (i.e. 5 years ago) you had a mix of Netbeui and IPX and TCP on a network and it didn't make much sense to make a card intelligent enough to figure out where to put packets.
But now all anyone cares about is TCP. Furthermore, a typical copy of data to a server goes something like:
1) packet sent by the client to a known port on the server
2) a few packets to set things up and assign a dedicated server port
3) lots of data blasting from the client to the dedicated server port
4) some cleanup packets at the end

Step 3 is what you care about. So you would need to tell the network card, when you get packets for this port, put the data in this buffer in the order received, and put the headers here (in some small header-sized buffers TCP would also provide). Now you might get bad checksums (although the hardware could check that also) or drops or out of order, then you would need to rearrange...but in the 99%+ normal case you get all the packets in order with valid checksums. So the card stuffs the data in the right place, TCP checks the header buffers to make sure everything is kosher, and boom your data is in memory with no copies and off to disk (or wherever) it goes.
You need some other stuff like TCP has to be able to hint this to the network card driver, and figure out if more than one app is using a port (so it can turn all this optimization off) and so on. But hey when it worked it would be cool.
The other way this would work is if the network card was set up with a big chain of receive buffers and it would actually hand a buffer up to TCP (so it got taken out of the chain) and then eventually it would get it back...but this requires a lot of trust of the levels above TCP that ultimately decide when the receive data isn't needed anymore.
As Dilbert said this weekend...if you can understand the preceding, you have my sympathy.
- adam

Share
twitter facebook
- Re:what about zero copy on receive? (Score:2)
  
  by EvlG ( 24576 ) writes:
  
  This paper [utah.edu] describes a distributed shared memory system in which just such a hardware mechanism was used, called direct deposit. The paper has some information on it.
- Re:what about zero copy on receive? (Score:3, Insightful)
  
  by Espen Skoglund ( 204722 ) writes:
  
  The problem is not necessarily that of early demultiplexing of incoming packets to the appropriate ports. The problem is that in order to achieve zero-copy you'll have to store the packet at the appropriate location in memory. I.e., you must store the incoming packet in the buffer location that the user level application uses for receive. Now, obviously you can not store the packet there unless the user has requested to receive more data (the buffer may be used for other purposes). A solution to this problem is to program all the applications so that their receive buffers are aligned on page-boundaries, and the page containing the receive buffer is only used for containing the receive buffer. This allows the kernel to receive incoming data onto empty pages and map these pages into the application when the application eventually issues a receive operation.
  Of course, there are more quirks to the problem than what I've discussed here. However, the point is that one can not easily implement zero-copy TCP receive without having well behaved applications (i.e., without modification of the application). Zero-copy TCP send is easier since the location of the outgoing packet is known to the kernel once the send operation starts.
  - Re:what about zero copy on receive? (Score:3, Insightful)
    
    by AdamBa ( 64128 ) writes:
    
    Sure you would have to only do it with certain things were true -- the user had already posted a buffer, nobody else was using that port, etc. But a standard situation like somebody ftp'ing up a huge file would match those conditions.
    Now I confess I don't know anything about the internals of any Unix version. What I worked on was NT. And in NT this would be very easy (memory-management-wise I mean) as long as you had the user's buffer ahead of time...no need to have receive buffers aligned on a page boundary or anything. Since I don't know *nix, I don't know why that restriction would be needed. The card could receive anywhere in memory (doesn't need to be aligned)...and in NT you can map any user buffer into kernel space so any device driver can access it, then lock it to a physical address so the card can access it.
    - adam
    - Re:what about zero copy on receive? (Score:1)
      
      by Espen Skoglund ( 204722 ) writes:
      
      True. One can indeed optimize for the common case (i.e., application is waiting to receive data), and handle other cases in another manner. Just to hand the ball over to you again, though; when receiving data into a user buffer you don't want to receive packet headers, etc., into the same buffer. As such, you must have some way to split up the receive so that the headers and tailer go into one location and the packed data goes into another. This can, depending on the networking hardware, occur completely at the network card (i.e., programmable cards) or at some higher level in the OS. I don't think that one general solution to the problem exist (although I must admit that I by no means qualifies as being knowledgeable in the field).
      - Re:what about zero copy on receive? (Score:3, Interesting)
        
        by AdamBa ( 64128 ) writes:
        
        Yes, you need separate buffers for the headers (I tried to explain this in my first post but wasn't that clear).
        Let's say you get a 64K buffer from the user. So you hand it to the card and say "all data for port 0x1234 goes in here." Then you also give a chain of receive headers, 64 bytes or whatever. The processing after that should be pretty straightforward...when the card interrupts you with a packet received, it sets a flag saying that the data was put in a user buffer. Then the network card driver tells TCP it has a packet and that it has split header/data and the data is at location XXX. At that point the processing should be basically the same for TCP, verifying checksums and headers etc, but then at the last step where it would copy the data to the user's buffer, it just doesn't have to -- as long as the data was supposed to wind up at XXX.
        The tricky case is handling drops and dups and out of order. For example if the fifth packet in a transfer is received third, then TCP can't just move it to the right spot because the card may be using that spot to receive another packet. In general trying to tell the card "oh you should back up and start receiving new packets here instead of here" is tricky timing because a packet may be coming in while you are trying to tell the card that.
        Of course in situations like this TCP doesn't have to be perfectly optimized since you will likely need to retransmit anyway, but it shouldn't be terrible. And the card will also need to be given a set of general buffers for packets that are not to an expected port, or where the user buffer runs out of room, etc. Then TCP has to be clever about putting those packets in the right place.
        You could have the card be smarter and actually known about where to put each packet, it couldn't even do acks and retransmit requests...but you don't want to make it too complicated. Plus I think you want to avoid having the card need to interpret any part of the packet that is encrypted during an IPSEC session (which I don't know exactly where that begins). Some cards do IPSEC in hardware but that is another issue.
        And of course this only helps if the server is CPU-bound, as opposed to disk or network etc.
        - adam
- Re:what about zero copy on receive? (Score:2, Insightful)
  
  by thorpej ( 316096 ) writes:
  
  A zero-copy receive path is a significantly harder problem to solve.
  
  Basically, devices DMA into host memory. These buffers must be preallocated, since you never know when a packet might arrive, and when it does, it needs to go into memory immediately, since the temporary storage in the Ethernet MAC itself is quite small.
  
  When the data arrives, we still don't know which application it is for. We have to parse headers, etc. to determine that. And once we do, we have a buffer that is:
  
  1. Not page-aligned.
  2. Not page-rounded.
  
  This makes it very difficult to "page flip" the buffer into userspace.
  
  The Trapeeze/IP project at Duke implemented a zero-copy receive for FreeBSD, but it required special modificaitons to the firmware on the Alteon ACEnic Gig-E interfaces they were using. Those interfaces aren't even manufactured anymore, and there are essentially no Gig-E interfaces on the market today which allow you to hack the firmware in such a way. So, their solution pretty much can't be used unless you have full control over the hardware that's going into your device (i.e. it's pretty much of use only to people building embedded systems from scratch).
  
  Therefore, in the absense of another solution, you are forced to perform at least one copy on the receive side: from the interface's receive buffer into a page-aligned/page-sized buffer in the socket. Once you have that, you *can* page-flip into user space, however, and since the copy across the protection boundary is usually more expensive than a copy within the same address space, so there's still some benefit that can be realized.
  - Re:what about zero copy on receive? (Score:2)
    
    by AdamBa ( 64128 ) writes:
    
    I understand the issues you bring up, but I still think it can be done. You are thinking of it as receiving into one of the standard buffers you have preallocated into the netcard's receive ring, and then somehow making this map into the user's buffer. That's not what I mean.
    I mean in certain cases (but cases that generally correspond to the ones you would care about, large transfers to a server), you have the user's buffer ahead of time, and furthermore there is just one client of the port, and TCP knows this because it has *assigned* the port to that client. For example when the ftp daemon is receiving a put command, it moves the client computer over to a random port, which it asks TCP to assign to it. Let's say TCP assigns port 0x9876. So TCP knows that only the ftp daemon is the only one using port 0x9876. And the ftp daemon has posted a big receive to get the data. So TCP can take that receive's buffer and hand it to the driver and say "all data [but not the headers] received on port 0x9876 goes in this buffer." Then the card packs the data in and in the normal case where every packet is received only once and in order, it works.
    The firmware mod you would have to do is instead of the card having one big ring of receive buffers, it needs a bunch of "per-port" buffers (plus one big ring as before). Now that takes some space to set up the control, but the buffers themselves are all user buffers, so there is not a *lot* more storage needed. The firmware needs to pick out the port number "on the fly" while it is receiving the packet, but I think it could to that.
    There may be alignment issues....maybe the card has to say "gee the next spot for packets to port 0x9876 has an address that is congruent to 3 modulo 4...so it has to put one extra byte somewhere (along with the header maybe) and then start its transfer at the 4-byte boundary.
    If you want to slap down credentials, I was one of the main designers of both NDIS (the transport <-> network card interface) and TDI (the transport <-> level above) interfaces in NT. And I wrote several transports and netcard drivers. So I know a bit whereof I speak.
    - adam
- Re:what about zero copy on receive? (Score:2)
  
  by NoMoreNicksLeft ( 516230 ) writes:
  
  What a retarded thing to say. For one thing, there will soon be ipv6 TCP, is this the tcp you're referring to, or ipv4's version?
  
  There are too many legacy systems and apps out there, for me to have to worry if the replacement nic is going to try and outsmart me. That's something that happens all too often in windows, and I'll be damned if it happens in hardware without me bitching up a storm.
  
  I regularly see NetBEUI, DECnet, and IPX on the systems I work with, and even something stranger from time to time. That was undoubtedly one of ethernet's core strengths, the ability to be protocol agnostic. On every other physical/logical protocol, you always had to jump through hoops to use a different protocol, but ethernet just doesn't care.
  
  I do care about more than just TCP/IP. If you weren't a fool, you would too.
  
  Suggestion to moderator: Flamebait, yes. Troll, no. If you are too stupid to see the difference, then I'm sure the meta-moderator won't be.
Linux has better zerocopy TCP, and here's why (Score:4, Informative)

by Jamie Lokier ( 104820 ) writes: on Monday May 06, 2002 @11:56PM (#3475187) Homepage

True zerocopy has certain hardware and driver requirements. These are the network drivers in linux 2.5.9 which support zerocopy TCP: 3c59x, acenic, sunhme, 8139cp, e100, ns8320, starfire, via-rhine, sungem, e1000, 8139too, tg. (Disclaimer: Not all cards supported by those drivers necessarily support full zero copy). That's from grepping for the NETIF_F_SG and at least one of the NETIF_F_(IP|NO|HW)_CSUM flags.

In Linux, zerocopy is performed using the sendfile(2) system call, rather than writing to a socket from a memory-mapped file, as you are meant to do with the new BSD code. Although the mmap method is a neat way to make a few existing programs faster, it is less efficient than the sendfile() method, to some degree, and certainly more complicated to implement.

A write-from-mmap implementation has to provide a certain allowances for user space behaviour. Although it's advised not to touch the pages from user space, allowance for this basically require the OS to "pin" pages, either by modifying page tables which implies TLB and page walking cost (if the pages are actually mapped, which they probably are not in a Samba/www/ftp server), or by at least pinning the underlying page cache pages in case someone does a write() to the mapped file. sendfile() does not require the pages to be pinned, because it provides different guarantees about which data is transmitted if the data is being modified concurrently.

Another nice thing about sendfile() is that it's quite fast even for small files. The overhead of calling mmap() and then munmap() may outweigh the copying time for a small transmission. Basically, why bother with mmap/write/munmap when you can just do a sendfile, which doesn't require the kernel to jump through hoops to decode what you meant.

Well, I don't know if it makes much difference in performance if you only mmap() a file without referencing the pages from user space, and write it to a socket. We'll have to wait for the numbers.

But there is another great thing about sendfile! You can use it to transmit user-space generated data, such as HTTP headers, too. This is done by memory-mapping a shared file (such as a pure virtual memory "tmpfs" file, but you can use a real disk file too). Then you can write to that mapped memory from user space, and call sendfile() to transmit what you have just generated.

You can do what I just described, with a mapped, shared file, using the new BSD zerocopy patches too. If using sendfile(), the weaker concurrency guarantees of sendfile() vs. write() mean it is your responsibility to not modify the data until you are sure it's been received at the far end. In some ways user space has more responsibility, to be carefully manage the data pool with this method of using sendfile() for program-generated data, than using BSD-style write(). On the other hand, that's exactly why the BSD kernel must do more work of pinning pages, and in this mode of usage there is definitely TLB flushing cost and cross-CPU synchronisation cost, so if you are really crazy for performance, sendfile() may just have the edge. (Well I expect so anyway, I haven't done performance comparisons).

By the way, write() from memory-mapped files has been discussed among linux kernel developers several times in the past, and each time the idea lost due to the feeling that page table manipulation is not that cheap (especially not on SMP), and now that we have sendfile... Well, if you were writing a really high performance user space server, you'd use sendfile anyway so writing from mmap becomes a bit moot.

Finally, zerocopy UDP is not implemented in linux at present as far as I know, but some gory details were discussed recently on the kernel list so it is sure to arrive quite soon. The difficult infrastructure (drivers, page-referencing skbuffs) which is used by the zerocopy TCP implementation has been part of the 2.4 kernel since 2.4.4 (more than 1 year ago) and I believe it has been thoroughly tested since then.

Enjoy,
-- Jamie Lokier

Share
twitter facebook
- FreeBSD has zerocopy sendfile... (Score:5, Informative)
  
  by lamontg ( 121211 ) writes: on Tuesday May 07, 2002 @03:59AM (#3475802)
  
  A basic bit of research by running 'man sendfile' on a FreeBSD system would have told you this:
  
  SENDFILE(2) FreeBSD System Calls Manual SENDFILE(2)
  
  NAME
  sendfile - send a file to a socket
  
  LIBRARY
  Standard C Library (libc, -lc)
  
  SYNOPSIS
  #include sys/types.h
  #include sys/socket.h
  #include sys/uio.h
  
  int
  sendfile(int fd, int s, off_t offset, size_t nbytes,
  struct sf_hdtr *hdtr, off_t *sbytes, int flags);
  
  DESCRIPTION
  Sendfile() sends a regular file specified by descriptor fd out a stream
  socket specified by descriptor s.
  
  [...]
  
  IMPLEMENTATION NOTES
  The FreeBSD implementation of sendfile() is "zero-copy", meaning that it
  has been optimized so that copying of the file data is avoided.
  
  Parent Share
  twitter facebook
  - Re:FreeBSD has zerocopy sendfile... (Score:2, Interesting)
    
    by Espen Skoglund ( 204722 ) writes:
    
    And you also forgot to mention:
    HISTORY
    sendfile() first appeared in FreeBSD3 .0. This manual page first appeared in FreeBSD 3.1.
  - Ok, you're right. (Score:1)
    
    by Jamie Lokier ( 104820 ) writes:
    
    You're right, (hangs head in shame...).
    I had a nagging feeling when I wrote the article ;-)
    
    Ah well, at least I managed to explain some of the differences between mmap-write and sendfile.
    
    Thanks,
    -- Jamie
    - Re:Ok, you're right. (Score:2, Insightful)
      
      by thallgren ( 122316 ) writes:
      
      Right, but you completly failed to explain why the sendfile-only/no-mmap zero-copy in Linux is better. Instead you mention that one should create files in tmpfs and use sendfile on that. That can hardly be zero-copu anymore, can it?
      
      Regards, Tommy
- - Re:Linux has better zerocopy TCP, and here's why (Score:1)
    
    by qeL3-i ( 577868 ) writes:
    
    The GNU manpage for sendfile says it's not portable and shouldn't be used. That's good enough for me, don't use it.
    
    VERSIONS
    sendfile is a new feature in Linux 2.2. The include file
    is present since glibc2.1.
    
    Other Unixes often implement sendfile with different
    semantics and prototypes. It should not be used in
    portable programs.
    - Re:Linux has better zerocopy TCP, and here's why (Score:1)
      
      by lamontg ( 121211 ) writes:
      
      No, the GNU manpage says that it should not be used in programs that you expect to be easily portable. That is a warning to the programmer that if they use sendfile() that they will likely need to code for multiple different APIs, and will need to use tools like GNU autoconf to work around differences. That does not mean that sendfile() "should not be used."
      In fact, apache 2.0 now does exactly this -- determines the version of sendfile() on your system and implements the appropriate code to use it.
      Slashdot: Bored at work? Come explain elementary coding practices to 15 year old hax0rs!
- hoo baby (Score:3, Insightful)
  
  by AdamBa ( 64128 ) writes:
  
  Don't know anything about Linux, but compared to NT this seems incredibly complicated. In NT you can take any user buffer, probe it, map it, and lock it. Don't have to worry about "user space behaviour" and whatnot. If the user is doing synchronous write()s (or whatever) the call won't even return until the kernel code tells says it is done with it. And for asynchronous, you would hope the user won't be walking all over the buffers it has handed over until the call is done. And in all cases the kernel components handle the buffers in the same way, you don't need your network card to indicate with a flag whether it can do zero-copy sends (?!?!?!?!?!?!?).
  - adam
  - NT's method is the same as the new BSD code (Score:2, Insightful)
    
    by Jamie Lokier ( 104820 ) writes:
    
    I know that NT has been able to pin user buffers forever, and that can be used for synchronous I/O. It's not suitable for asynchronous I/O, though: it's no good "hoping" the user doesn't overwrite buffers until the TCP transmitted data has been acknowledged. Applications are often written to assume they can call the equivalent of the unix write() call, and then overwrite their application buffer to prepare for another write.
    
    You said that all kernel components handle buffers in the same way, and thus all network cards handle zero-copy sends. In fact this is not true for all supported network cards: some cards simply don't have the hardware to transmit a packet that is in physically discontiguous memory. And that's what you have when doing zero-copy sends from user buffers. So, either NT's kernel or the vendor-supplied device driver must copy the data into contiguous memory, or onto the card itself (typical for ISA cards). When that happens it isn't zero-copy any more, although it is transparent to the application -- just like the Linux and BSD mechanisms!
    
    have a nice day,
    -- Jamie
    - Re:NT's method is the same as the new BSD code (Score:2)
      
      by AdamBa ( 64128 ) writes:
      
      I meant asynchronous I/O where the app is expecting it to be asynchronous. With sockets there is some way to wait for a write to complete (which I don't remember offhand, I'm much more familiar with the kernel parts), and using the NT API you can do a wait...I think even with Win32 file APIs you can wait. But I think the default is to have I/O be synchronous, so a dumb app will work. And I disagree that you can't expect an app doing asynchronous I/O not to touch a buffer until the write is done. That's just a broken app. What if it is multithreaded and changes the buffer during the write? What if it zeros it out BEFORE the write?!? Etc etc.
      It's true some cards can't do a busmaster send from discontiguous buffers. So they do have to copy to contiguous memory, or to card memory, or do DMA, or whatever...but that's something the card driver has to just deal with. Nobody above the network card driver has to care.
      So it sounds like the network card flag you mentioned is just an indicator of the ability to send discontiguous buffers directly? That is, a status indication, not something an app has to worry about....so that's fine, I withdraw my ragging on that.
      - adam
- Re:Linux has better zerocopy TCP, and here's why (Score:1)
  
  by thorpej ( 316096 ) writes:
  
  Yes, Linux does have sendfile(2) while NetBSD does not. No argument there. However, sendfile(2) has some issues:
  
  1. It only works for sending complete files, and I seem to recall that it closes the file at the end of the transaction. This doesn't work for e.g. Samba servers, which need to send chunks of files.
  
  2. sendfile(2) doesn't work for data sourced from somthing other than a file. Consider a database server which maintains a memory-resident cache. Or consider piping output from a command, say, dump(8), over the network. Or consider the case of an iSCSI target device, which might have mmap'd a chunk of disk/file blocks to serve on-demand.
  
  My change works for those 2 (important!) cases above.
  
  That is not to say that NetBSD won't get a sendfile(2)-like mechanism in the future (it will probably get a splice(2) system call, which allows you to hook together 2 arbitrary file descriptors, one source, one sink -- essentially a generalization of sendfile(2)).
  
  It's also worth noting that the NetBSD zero-copy TCP/UDP transmit path doesn't require anything special from a device driver; the driver simply needs to be able to DMA from arbitrary memory addresses. And even if a device can't do this, you have still reduced memory bandwidth consumption by eliminating the copy from user space to kernel space.
  - - Re:Linux has better zerocopy TCP, and here's why (Score:1)
      
      by thorpej ( 316096 ) writes:
      
      Regarding sendfile(2)'s semantics ... thanks for correcting me on that. (I must admit it has been a while since I looked at specific sendfile(2) implementations.)
      
      Regarding Tx DMA alignment for cheap PCI cards... I have written a fair number of Ethernet drivers over the past several years, and I can't think of any chip that required any more than 4-byte alignment of the DMA buffer. The ones that spring to mind are the RealTek 81x9, Xircom X3201, and the VIA "Rhine" series... oh, and e.g. the Alchemy Au1000 built-in Ethernet MAC also requires 4 byte alignment.
      
      But by far the common case is that the chip can DMA from an arbitrary byte location in memory. I don't really consider this a "special" feature. Rather, I consider chips that can't do this to be "crippled".
      
      (Note, it IS fairly common for Ethernet chips to have stupid limitations on the *receive* side, specifically 4-byte alignment of the Rx buffer, which means the IP header ends up misaligned after the 14-byte Ethernet header. This is truly annoying, since software has to copy data to fix it up.)
      
      Regarding the 3 things you have to do to do zero-copy. NetBSD's virtual memory system has a generic framework for handling "loaning" of pages from one VM object to another. The uvm_loan facility is currently used by pipe(2) and socket writes (new with my changes). That said:
      
      1. Yes, we support full correctness of the written data. If another thread (or the same thread) touches the page before the kernel is finished with it, the loan is considered broken, and a copy-on-write fault is taken to resolve the situation.
      
      2. This is really the same question as (1). The pages aren't marked as "write in progress", per se, but are marked as "loaned" (loans can be used for things other than just outbound I/O, although that is the most obvious use of the facility).
      
      3. Yes, there is TLB maniuplation traffic. This is why you pick a threshold for using the loaning facility. Doing it for small writes would be stupid, since the copy would be less overhead than the TLB traffic. That said, even in an MP system, it's not too bad, since NetBSD uses explicit barrier operations for low-level VM operations (so that the machine-dependent "pmap" module can coalesce TLB operations if it would be beneficial to do so). In any case, the expense of TLB shootdown traffic is largely an implementation issue.
*BSD needs our help! not our pity :-) (Score:1)

by freebsddude ( 556159 ) writes:

We see a lot of posts indicating that *BSD is dying. OTOH, we see a lot (literally a LOT!) of enthusiastic and excited FreeBSD (*BSD) newbies looking for help with basic *BSD questions.

An important question we all need to ask ourselves: do we want to be plain sideline observers, or do we choose to help grow *BSD? Again, I am sure all of you *BSD geeks will agree that it would be in our best interests to promote *BSD!!!

This is not an infomercial on our part, but a request for all you "expert" *BSD geeks to give back to the *BSD cause by PARTICIPATING, visiting your favorite *BSD sites (or whatever else !!!), and by helping answer newbie questions.

Newbies sometimes feel intimidated/overwhelmed by mailing lists, complex howtos, etc. and dont exactly know where to start. Some prefer asking simple questions in forums, or following simple howtos, etc. It would be in our best interest to encourage these folks and turn them towards FreeBSD, OpenBSD and NetBSD!! Encouraging them will promote *BSD committment (committers) with ports/applications, OS and security enhancements. *BSD needs our help!!
- Re:*BSD needs our help! not our pity :-) (Score:1)
  
  by kwerle ( 39371 ) writes:
  
  Funny post - BSD is now the highest volume *nix in the computer biz. Thanks to Apple shipping on Darwin. What's more, users *REAL USERS* don't give a hoot about "basic *BSD questions". They just want it to work.
  
  And it can. And it does.
  
  OK, so maybe this is a troll. But maybe it's insightful, and maybe it's just funny...
OpenBSD and NetBSD? (Score:2)

by cpeterso ( 19082 ) writes:

I wonder if OpenBSD will copy this new NetBSD code. Does anyone know which BSD has more users, OpenBSD or NetBSD? I hear few people ever talking about NetBSD, but many of the OpenBSD changelogs have comments like "copied this from NetBSD". Does OpenBSD simply "ripoff" NetBSD development?
- Re:OpenBSD and NetBSD? (Score:2, Insightful)
  
  by keramida ( 41339 ) writes:
  
  > Does OpenBSD simply "ripoff" NetBSD development?
  Nope. The sharing of code and ideas among the various *BSD implementations is a Good Thing(TM). Developers from more than one groups that join their efforts and write code that is easy to port, clean, and useful to more than one of the BSDs are out there. And they have been doing a great job for quite some time. There's no ripping off, but a cooperative spirit that we should be glad for.
  To use your example, what if the "ripping off" of code that OpenBSD is accused for, helps in revealing some bug that is caused by assumptions that are true only for NetBSD? Is then the "ripping off" justified and part of a "development process" or is it still a bad thing? What if the OpenBSD folks write back to the original authors and help in fixing possible problems with the NetBSD codebase? Aren't they then assisting in making NetBSD a better OS too?
  - Giorgos
- Re:OpenBSD and NetBSD? (Score:1)
  
  by alan_d_post ( 120619 ) writes:
  
  The neat thing about information is that copying it does not put any burden on the original author. Theo forked NetBSD -- you could too, if you wanted. It hasn't exactly killed NetBSD. Changes flow between all three projects.
  
  Of course, I've probably just been trolled . . . .
- Re:OpenBSD and NetBSD? (Score:1)
  
  by Art Deco ( 529557 ) writes:
  
  As others said OpenBSD was an offshoot of NetBSD. NetBSD itself started life as a system that combined the Jolitz's 386BSD with collected patches so you could say that NetBSD "ripped off" 386BSD. Since all the software is free nobody is ripping anyone else off though. Since OpenBSD is based in Canada they can ship their system with strong encryption since they are not subject to the USA's fascist crypto laws.
  - Re:OpenBSD and NetBSD? (Score:1)
    
    by yerricde ( 125198 ) writes:
    
    Since OpenBSD is based in Canada they can ship their system with strong encryption since they are not subject to the USA's fascist crypto laws.
    
    The USA crypto laws are no longer fascist. The Bureau of Industry and Security (BXA) [doc.gov] has lifted most restrictions on exporting free [gnu.org] cryptographic software.
Overlapped IO / Win32 (Score:2)

by strags ( 209606 ) writes:

Can someone tell me if Overlapped IO under Win32 is also zero-copy? It seems like it could be - basically you pass buffers to the OS, and the OS lets you know when it's finished with them. This works for input and output.

The other nice thing about OIO is that you don't have to populate your FDSET every time you do a select - which means if you're writing a server app with thousands of simultaneous connections, it's a whole lot faster.

Is there a Linux/BSD equivalent to this?
- Re:Overlapped IO / Win32 (Score:1)
  
  by thorpej ( 316096 ) writes:
  
  POSIX defines an asynchronous I/O interface, "aio", and it is implemented in several Unix variants. It pretty much has the semantics you describe.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re:Packet loss (Score:5, Informative)

Re:Packet loss (Score:1)

good point (Score:2)

Re:Packet loss (Score:1)

Port it to Linux!! (Score:1, Troll)

Re:Port it to Linux!! (Score:1)

Re:Port it to Linux!! (Score:1)

Re:Port it to Linux!! (Score:2)

Re:Port it to Linux!! (Score:1)

what about zero copy on receive? (Score:3, Interesting)

Re:what about zero copy on receive? (Score:2)

Re:what about zero copy on receive? (Score:3, Insightful)

Re:what about zero copy on receive? (Score:3, Insightful)

Re:what about zero copy on receive? (Score:1)

Re:what about zero copy on receive? (Score:3, Interesting)

Re:what about zero copy on receive? (Score:2, Insightful)

Re:what about zero copy on receive? (Score:2)

Re:what about zero copy on receive? (Score:2)

Linux has better zerocopy TCP, and here's why (Score:4, Informative)

FreeBSD has zerocopy sendfile... (Score:5, Informative)

Re:FreeBSD has zerocopy sendfile... (Score:2, Interesting)

Ok, you're right. (Score:1)

Re:Ok, you're right. (Score:2, Insightful)

Re:Linux has better zerocopy TCP, and here's why (Score:1)

Re:Linux has better zerocopy TCP, and here's why (Score:1)

hoo baby (Score:3, Insightful)

NT's method is the same as the new BSD code (Score:2, Insightful)

Re:NT's method is the same as the new BSD code (Score:2)

Re:Linux has better zerocopy TCP, and here's why (Score:1)

Re:Linux has better zerocopy TCP, and here's why (Score:1)

*BSD needs our help! not our pity :-) (Score:1)

Re:*BSD needs our help! not our pity :-) (Score:1)

OpenBSD and NetBSD? (Score:2)

Re:OpenBSD and NetBSD? (Score:2, Insightful)

Re:OpenBSD and NetBSD? (Score:1)

Re:OpenBSD and NetBSD? (Score:1)

Re:OpenBSD and NetBSD? (Score:1)

Overlapped IO / Win32 (Score:2)

Re:Overlapped IO / Win32 (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals