Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
BSD Operating Systems

Preparing for the Worst in FreeBSD 286

LiquidPC writes "In Part I of this series, Michael Lucas, from ONLamp.com, goes over preparing your FreeBSD computer for the worst in case of a system panic."
This discussion has been archived. No new comments can be posted.

Preparing for the Worst in FreeBSD

Comments Filter:
  • by Anonymous Coward
    I am a Computer Information Systems Professional [devry.edu] at a major Fortune 500 corporation. Very recently the head of our IT department decided that we were going to switch every one of our networks over to Windows XP Professional. We had previously been running OpenBSD on all our quad processor Xeons. Some of them had had uptimes approaching a year! My personal favourite, Gerbil, had been running without a reboot for three years.

    One day one of those Microsoft shills that you often read about on the Register [theregister.co.uk] came by for a visit. I grew very suspicious about what was going on when my boss and the Microsoft representative walked by my desk, and entered the server room. I could hear muffled voices through the closed door. The Microsoft representative was asking what we were running on our servers! My worst fears had come true. I sat at my desk for the rest of the day, silently awaiting the bad news. The news did not come until the next day. It was worse than I had feared. We were to be a Microsoft only shop from that day on! I could not believe it. The Microsoft representative had told my boss that the operating and support costs would actually go down. And my boss had fully bought into it, hook, line, and sinker.

    Tough times hit our company in the last month, and we were forced to lay off a few of the less experienced IS/IT workers. One of them took this rather hard. As a last minute attempt at corporate sabotage, he decided to change all of the Computer Administrator passwords on a few of the XP Professional boxes sitting around in the server room. This caused absolute havoc, as Dell had failed to send along administrator passwords for the new boxes. Our company could not make use of these computers for three days. It took Dell that long to get us the administrator passwords. It is strictly because of Microsoft's poor implementation of a multi-user computing environment that our company lost three days of productivity.

    Needless to say, I had our quad Xeons back running OpenBSD by the end of the week. Gerbil is back on its way to another glorious 3 years of uptime.
    • by buffy ( 8100 ) <buffy@p a r a p e t .net> on Sunday March 31, 2002 @01:24AM (#3258486) Homepage
      First some nit-picking...

      Very recently the head of our IT department decided that we were going to switch every one of our networks over to Windows XP Professional.

      Windows is an Operating System, not a network. Your network probably "runs" TCP/IP, Netbios, and a handful of other protocols. Windows runs on desktops, laptops, and servers.

      he decided to change all of the Computer Administrator passwords on a few of the XP Professional boxes sitting around in the server room. This caused absolute havoc, as Dell had failed to send along administrator passwords for the new boxes. Our company could not make use of these computers for three days. It took Dell that long to get us the administrator passwords.

      This last paragraph is a touch more concerning...first of any Windows box I've purchased from Dell, or others, have no administrator password, or are set to "admin". Why would Dell have set specific passwords for your systems? I'm just a little bit confused.

      On a related point, even for those systems that come pre-installed with an OS, it's [my] standard practice to bare-iron re-install from scratch. I'm not a huge fan of MS (quite the opposite), however, in the hands of someone who has a solid understanding in operating systems, it IS possible to build a stable Windows box. I have an NT 4 server, running a database, and a mail exchange, that has an uptime of 94 days. It was rebooted for a disk addition. It was up 86 days prior to that (it's installation date.)

      That said, I prefer and use Linux and Solaris much more frequently, and, unlike the windows example above, am not surprised by the continued uptime of my hosts! ;)

      Now, I've gotta ask...why did you just sit at your desk waiting for the bad news?? I've (and my VP) have recieved visits from MS cronies in the past. The thing is, those people are sales/marketing weenies. Get in on the meeting, and use your own skills to ask very pointed questions. Its not very difficult to run circles around these droids. Keep it calm, polite, and just bury them in the technical truths which they simply cannot refute. If they try to call you a "Linux zealot" you know you're on the right track, and they're in the process of losing their cool. As long as you keep it together, and don't let them change the topic, I've found that its pretty easy to expose others in my company to MS's shortcomings...right in front of MS folks themselves.

      If you just sit back and let non-techs make tech decisions without, at least, making them aware of the ramifications of such things, then you really can't blame them. Its kind-of what they say about voting, right? If you don't vote, you don't have the right to complain?

      Now, if you work in a super huge corporation where such things are a fact of life, I'm sorry, and you probably don't have a choice. Well...other than to extract yourself from between Mr. Rock, and Mr. Hardplace.

      • Talk about hook, line and sinker! The mere mention of 'OpenBSD running on qaud processor systems' should have set alarm bells off in your little head.

        As an OpenBSD user, I am well aware that it does not support more than one processor. [openbsd.org] Ooh you have been so trolled. Priceless.

        • FreeBSD is what the article is referring to, not OpenBSD. FreeBSD does in fact support SMP [freebsd.org]
      • by Anonymous Coward
        Do NOT mod up people who have been blatantly trolled. This is simple common sense. The person who was trolled is a jackass for losing, but the person[s] who modded his post up is a COMPLETE FUCKING IDIOT.
    • WHO CARES! This is soooo blatently off-topic, you have to be a huge greasy TROLL, just out of the hole!
  • What? (Score:5, Funny)

    by tcd004 ( 134130 ) on Sunday March 31, 2002 @12:53AM (#3258247) Homepage
    Where are the color-coded states of emergency? This is no respectable anti-panic plan.

    Witness the rebirth of ENRON! [lostbrain.com]
    tcd004
    • Re:What? (Score:1, Funny)

      by 56ker ( 566853 )
      The different color coded states of emergency are reflected in your face - going from normal sysadmin pasty white to even whiter when you realise it's crashed - then purple when you realise you've got no backtrace - and therefore no hope of fixing the problem.
    • Re:What? (Score:4, Funny)

      by Loligo ( 12021 ) on Sunday March 31, 2002 @04:18AM (#3259098) Homepage
      >Where are the color-coded states of emergency?

      Courtesy of IMDB and Red Dwarf...

      Rimmer: We can't afford to take any chances. Jump up to red alert.

      Kryten: Are you sure, sir? It does mean changing the bulb.

      -l
  • "To prepare for a kernel panic, you need the system source code installed. You need one (or more) swap partition that is at least one MB larger than your physical memory and preferably twice as large as your RAM. If you have 512MB of RAM, for example, you need a swap partition that is 513MB or larger, with 1024MB being preferable." And people bash Windows for its lack of stability. I'm sorry, but an OS that can crash for seemingly no apparent reason, can barely be fixed, and requires a bunch of preparation just to prepare is too complicated for me. If I were a server admin with a few years experience with this OS and going the long way around to ensure a smooth ride, I might be more enthusiastic about the whole thing. At least with Windows based OSes all you need is a bit of veteran intuition and skill to find out what is wrong. Even if the problem isn't obvious, the solution usually is, or its easy to figure out.
    • "requires a bunch of preparation just to prepare". Yeah that's what sucks about preparation all right.
    • by Brett Glass ( 98525 ) on Sunday March 31, 2002 @01:21AM (#3258463) Homepage
      You write;
      I'm sorry, but an OS that can crash for seemingly no apparent reason, can barely be fixed, and requires a bunch of preparation just to prepare is too complicated for me.

      And you run Windows?

      --Brett GLass

    • Ever seen a Windows NT "STOP" error? It pops up during the boot process in a nice, handy little blue screen, such as the following:

      STOP 0x0000000A(0x04053292, 0x00000013, 0x0000001, 0x04A754F0)

      IRQL_NOT_LESS_OR_EQUAL

      Okay... using your "veteran intuition and skill", tell me what's wrong using only this information. You see, even if FreeBSD (and assorted other Unix-like OSes) need extra preparation to find out what's wrong, at least you *can* find out what's wrong.

      (Yes, as a remotely competent MS sysadmin, I know about core dumps and so forth, but the FreeBSD solution [a symbolic backtrace] is far better. Also, by the way, this is essentially an access violation that was done in kernel mode... which means you're still no closer to finding the answer.)

      • Comment removed based on user account deletion
        • Re:Too Complicated (Score:2, Insightful)

          by Thatman311 ( 316281 )
          Oh your so so so wrong. A stop 0xA is purely driver bug. It typically occurs when it tries to touch pagable memory at high IRQL (like in a DPC) and that memory is actually swapped out in the swap file. That particular case is due to poor programming practice on the driver writter's part. They should have allocated that memory as non-paged. Also before they shipped that driver they should have run "verifier" with special irql checking enabled. (For those who don't know what verfier is, it is a built in tool that is used to test device drivers [old and new]. If you are running a Win2k or WinXP box just open up the run line and type in verifier. You will get this program. Unless you know what you are doing and have a kernel debugger enabled and attached I wouldn't fuck with its setting or you may be looking at a blue screen due to a bug verifier found and you may not know how to recover it [without reinstalling]) If you want a defination of all of the bugchecks and what each parameter means download the lastest debugger from http://www.microsoft.com/DDK/Debugging/default.asp and look in the help file.
          • That may be true, but I ran into this on a friend's computer...but wasn't a software problem. Either he had a remarked CPU or (very unlikely) heat problems or something else on his system (not RAM, tried that right off the bat) was marginal, because the problem went away when the CPU was underclocked.

            Just pointing out that knowing the low-level cause (Ah, yes, that's when the network stack's detected an inconsistent internal state) may not be very useful in finding the high level cause.
    • Re:Too Complicated (Score:1, Insightful)

      by Anonymous Coward
      In 8 years of running bsd, I've NEVER had a kernel panic. The article is just about the kind of thinking that prevents kernel panics in the first place: careful precautions.

      Can you say the same about linux? Never a kernel panic? Never a corrupted file system? Never a bad kernel release? Hardly.
      • Actually, I've had kernel panics in every OS I installed on this machine for the first week after I built it. Windows crashed about 20 minutes after I started it for the first time. FreeBSD bombed shortly too. I decided to try Slackware 8.0. It crashed too. I started to debug, turned out to be a bad 256MB DIMM from Techtronics. Got a replacement for free.

        I've also seen MacOS X kernel panic (different machine, my iBook). Not in every day use, though. I've only seen it once, I started the computer up with the TV cable in it. The kernel paniced before even before the gui started. It was neat.
      • Actually, I had one about 30 seconds ago, on this box, and oddly enough it's still ticking (quite strange). Of course I'm working on a device driver at the moment, so I know actually who's fault it is :)
      • I've gotten a bunch of panics on my iBook (that's bsd, right?)--certainly far more often then i've gotten panics in linux.
        • Actually, MacOS X is an updated OPENSTEP (Mach microkernel), with the userland stolen from BSD and a Macish GUI.
          • Got an iBook, do 10 to 15 large driver builds/day with it -- no problems *at all*. Most reliable machine I have ever owned (though these two new VAIOs are starting to compete -- nice to see MS finally get the reliability on).

            David
            • Oh, sorry, I guess I was just smoking crack when hideous white on black console text overwrote the nice pretty quartz display with lots of hexadecimal numbers and lovely words like "PANIC"

              Maybe when OS XI comes out I'll think "Apple" and "stability" without getting the giggles. I love my iBook, but it is defintely the least stable machine I currently own.

              Because even a bug still exists even if you never see it on your machine, no matter how many asterisks you add to your post...

      • Ive seen FBSD crash... It was while it was shutting down; didnt cause any problems and I havent seen it since.

        Other than that, I havent seen it crash in more than 2 years.
  • Big Scary Deamons (Score:5, Informative)

    by Alien54 ( 180860 ) on Sunday March 31, 2002 @01:08AM (#3258359) Journal
    It is a bit easier to read without the ads, using the printer friendly page:

    Big_Scary_Daemons.html [onlamp.com]

    Yep, that is the name of the page.

    Michael Lucas lives in a haunted house in Detroit, Michigan

    Maybe we could move the ghost to Seattle?

  • Hardware prob (Score:2, Informative)

    by mcice ( 212918 )
    Panic 12 as described in the article is most likely a hardware fault somewhere on the mainboard. It is by far the most common cause of a panic on FreeBSD. Exchange mainboard, CPU and memory against working components and you are back up and running without the panics.
    • by Anonymous Coward
      Yes. I've yet to encounter a FreeBSD kernel panic that wasn't a hardware issue, two recent ones that I've had have been memory related. Now, my standard mode of operation is to put in a memcheck86 floppy and reboot before I do anything else.
  • by vrmlguy ( 120854 ) <samwyse&gmail,com> on Sunday March 31, 2002 @01:10AM (#3258383) Homepage Journal
    I'm a Sun admin by day, and Sun has always (since at least SunOS 4.1, when I started) made provisions to do this. I'll admit that I'm rarely cutting-edge with my Linux systems, so I haven't had any panics that I wanted to track down, so I don't know if Linux does this sort of stuff for you. I'm shocked that OpenBSD doesn't.
    • by tftp ( 111690 ) on Sunday March 31, 2002 @01:16AM (#3258436) Homepage
      I don't know if Linux does this sort of stuff for you

      On Linux, the kernel prints the backtrace on the console, and into the syslog if it can. Later you can run ksymoops on this backtrace to match it to the symbolic names. This requires no preparation, but since I never saw FreeBSD backtraces I can't say if it is of a similar detail level.

      • I think sorting out backtraces on FreeBSD are one of those things that with hindsight are a good idea - but at the time seemed like too much bother. For instance how many people regularly back up their hard drives? It's just good practice - but most users can't be bothered to do it.
      • One thing that helps with Linux is to enable console output to a serial line (if you have one spare). Then you can capture the OOPS message on a terminal emulator - time to fire up that HyperTerm on that Windows laptop that none of us has.
      • > since I never saw FreeBSD backtraces I can't say if it is of a similar detail level.

        It's effectively a big-ass core file you can run gdb on. Probably a tad more detailed than anything that will fit in syslog :)
    • I'm shocked that OpenBSD doesn't.

      OpenBSD was not discussed in the article.

      FreeBSD was.

    • I don't know why I typed "OpenBSD", I knew it was "FreeBSD", I guess that my fingers were typing ahead of my brain. Sorry if I offended anyone.
    • Is this a sun hardware feature though? I mean the other day (after months and months of uptime) I had a kernel panic on the machine (11 year old SS10 running debian linux) that is eventually going to route this submit.

      Long story short I couldn't log in, but if I went to the console I could see the kernel messages (logged) and if I hit enter it popped back to the login prompt (didn't work though). Funny thing is it was still routing traffic and looking up dns names - despite the fact I couldn't log in or access the console. I eventually hit stop-a (full break for those of use without a keyboard/monitor) and reset the machine.
    • I think OpenBSD does:
      http://www.openbsd.org/faq/faq5.html#Option s
    • Linux has a bit more painfull way of doing it today, but you might check out the "Linux kernel crash dumps" page http://lkcd.sourceforge.net which was started by SGI to mimick how Irix does it's crash dump analysis, now it's got both IBM & SGI backing along with the rest of the OSS world on sourceforge.
  • by Wheaty18 ( 465429 ) on Sunday March 31, 2002 @01:14AM (#3258422)
    'Any' key? Where's the 'any' key? I see 'ke-tarl', 'esk', and 'pig-uh', but there doesn't seem to be any 'any' key!

    Phew, all this computer hacking is making me thirsty.
  • by Anonymous Coward on Sunday March 31, 2002 @01:24AM (#3258493)
    This is most likely a hardware failure, possibly memory. Try memtest86 [memtest86.com] before you go on a kernel debugging hunt... basically, if your server has worked great for 12 months and then craps like this it probably ain't software.
  • by d_force ( 249909 ) on Sunday March 31, 2002 @02:19AM (#3258797)
    Usually, upgrades in the 4.x-RELEASE branch are made when selected improvements have been regression tested in the 5.x-CURRENT branch. Thus, if you're running a 4.x version, chances are you don't need to configure your system to do a full dump; usually there are people who've ran into similar problems and you can search for the fixes via mailing lists/usenet/etc...

    For more info, check out the FreeBSD Release Engineering Page [freebsd.org]

    Disclaimer:
    Yes, there's a slight chance you might come across some new bug in the 4.x tree; however, it's unlikely.

    • Usually, upgrades in the 4.x-RELEASE branch are made when selected improvements have been regression tested in the 5.x-CURRENT branch.

      Uhh... do you know what "regression testing" is? It's definitely not the same thing as verifying that backported features work.

      A good definition is here [wikipedia.com].

  • Who cares? (Score:5, Informative)

    by seanadams.com ( 463190 ) on Sunday March 31, 2002 @02:28AM (#3258829) Homepage
    I've been running two FreeBSD systems for over seven years each. I've had to do a grand total of *ONE* reboot that I can remember, aside from powering down to swap hardware, update the kernel, or to move the equipment.

    It's a damn stable OS. One of these machines is a dual PII/400, serving 700-1000kbps day in day out, with hundreds of active TCP connections at any given time, starting 15-20 new processes per second. The other machine is for a single, fairly busy web site doing 700kbps traffic.

    FreeBSD is rock solid. I have absolutely no need to plan for a kernel panic.

    • by harlows_monkeys ( 106428 ) on Sunday March 31, 2002 @03:07AM (#3258950) Homepage
      FreeBSD is rock solid. I have absolutely no need to plan for a kernel panic

      That's the downside of extreme stability...stupid people can get admin jobs, and since the OS doesn't crash, there's no chance for the admin to demonstrate their idiocy and get fired.

      • I've administrated numerous systems running OpenVMS. For the record, OpenVMS is extremely stable, with reported uptimes commonly counted in years. There was a system somewhere with an uptime of like 17 years or somesuch.

        Anyway... We've found that when there are multiple admins, one of the dangers is that someone will edit a system startup file to start up something new that they've started manually. Often, the change they'll make has a mistake. This will cause confusion and problems surrounding the next reboot (typically for OS upgrade, HW change, HW failure, moving machines around).

        We've actually taken to reboots every 6 months or so when people who might change startup files are around so that we can catch these kinds of problems.

        Of course, the high availability systems are all clustered such that the customers don't really see one machine with problems anyway...

        I've often thought that a monitor that reports startup file changes would be a good idea. Never got around to writing it though.

      • like MSCEs (not all, but more than enough) who just hit the reboot button? Lets face it, there are more than enough idiots who are employed AND demonstrate their idiocy on a regular basis but still manage to stay employed. Well idiots need jobs too, and you can't fit them ALL in the military...
    • by Anonymous Coward
      Seven years ago?

      One of these machines is a dual PII/400

      Whatever.

      Was BSD Dying in 1995 as bad as it is now?

      • Seven years ago?

        Yep - since back when Linux was still a play thing. For the first 6 months or so, the servers ran Solaris - big mistake cost-wise.


        One of these machines is a dual PII/400
        Whatever.


        I said "is", not "has been since 1995", you dumbfuck. BTW, you may be astonished to learn that the latest 2GHz machines are total overkill for most web sites serving <5Mbps, which is why I haven't had to upgrade since the PII days. I forgot to mention... BSD is *fast* too.

        Was BSD Dying in 1995 as bad as it is now?

        Clearly, no. Look at the numbers!

        I run linux on my desktop, where I need bleeding edge hardware support and the widest software compatibility. For the servers, FreeBSD has never let me down. You should give it try.
    • If you don't run devel kernels, I imagine that Linux, BSD, Solaris, and just about any decent UNIX knockoff is pretty stable.
      • Re:Who cares? (Score:3, Insightful)

        by tftp ( 111690 )
        Today the faulty or poorly supported hardware is much more likely reason for a crash. I have quite a few K6-2 and K6-3 boxes around, and they die like flies, after 1 or 2 years of continuous use; most often the motherboard fails. I had a Linux box that crashed once in 2 weeks; I moved the HDDs into another computer, moved most of cards and it now averages 150 days of uptime, interrupted only by power outages (no UPS there). Another K6-3 box sometimes fails in BIOS, during memory test in POST routine! I gave up on this one; it is not worth of my time. Needless to say, this box had all sorts of weird crashes in all OSes that I ran on it; NetBSD didn't even boot from the boot floppy, mumbling something about "garbage IDE DMA" :-)
    • I have absolutely no need to plan for a kernel panic

      Such famous last words.
    • Same here... I've been running 4 FreeBSD OS systems for the past 6 years. Works like a charm, however, I did have a system crash.

      I use Lone-Tar for FreeBSD as my backup solution. I simply did a quick re-install (took about 20 minutes with all defaults), then re-installed Lone-Tar and then restored my latest master. I was up and running again in 2 hours flat.

      I'm now working with Cactus [cactus.com] to create a disaster recovery (much like AirBag for SCO and Rescue Ranger for Linux) for FreeBSD.

    • I've been running two FreeBSD systems for over seven years each. I've had to do a grand total of *ONE* reboot that I can remember, aside from powering down to swap hardware, update the kernel, or to move the equipment. It's a damn stable OS. ...FreeBSD is rock solid. I have absolutely no need to plan for a kernel panic.

      who the hell gave this a 5 karma?
      ok, running a couple machines for seven years or even 100 years with one reboot says NOTHING about freebsd's kernel capability or strength.
      If you really want to give us some bragging material for freebsd and how kernel panic issues are unimportant, mention how much time your system(s) spend in kernel mode. Mention what hardware your kernel has to deal with. Are you using modules? What kind of filesystems are you using? Out of your "15-20 processes per second", do any of those process require a lot of paging? How much? How long do these processes last? Can your machine handle 65535 processes?

      you people make me sick the way you say your [insert OS here] is soooooooooooo stable, yet give no facts to back it up, just ONE or TWO cases out of the thousands of systems out there running the same OS (and kernel) in thousands of different environments. How do we know your OSs arent in a clean lab? We don't. How do we know the system will hold up in another totally different environment ? we don't. But we'll take your word for it since you have a couple fast computers on a fast DSL connection, running maybe a couple daemons, and because you have a 5 karma. Man, you don't realize how good you have it. You're situation is 100x easier to handle than the environments in all the fortune 500 corps. Anyway, I'm ranting, but trust me, EVERY *NIX is susceptible to kernel panics.

      Obviously you don't subscribe to freebsd-bugs@freebsd.org or you would have seen several kernel panic bugs mentioned during the past seven years.

  • "To prepare for a kernel panic, you need the system source code installed. You need one (or more) swap partition that is at least one MB larger than your physical memory and preferably twice as large as your RAM. If you have 512MB of RAM, for example, you need a swap partition that is 513MB or larger, with 1024MB being preferable."


    I've never been able to get a straight answer as to why the swap rule of thumb is double the ram. I guess that explains it, although since Linux puts the backtrace to the console and syslog maybe there is another reason as well...

    • I'm guessing here, but I think its so that if all your ram is idle, it gets swapped out. Chances are it won't be, but just in case, you know?
    • A very long time ago (Think: SunOS 4.x), you had to have more swap than RAM because the amount of virtual memory you had was EQUAL to the amount of swapspace you had. That is, every page of RAM had to be backed by a page of swap, or else it wouldn't end up being useful (I'm oversimplifying a bit).

      Now, in order for FreeBSD to be willing to save a core image, you have to have a swap partition with more space than you have in RAM, otherwise savecore will refuse to set things up. But for FreeBSD, the amount of virtual memory you have is equal to the amount of RAM you have PLUS the amount of swap space you've got set up (again, there is some RAM that gets used to hold the kernel image, so this is a bit of a simplification). Given that, it is perfectly ok to run a machine without any swap at all, provided you have a sufficient amount of memory to do everything you want to do. But having swap is good because it gives you some cushion, plus if you want to save cores from panics, you must, as I said, configure a swap partition with at least as much space as you have RAM.
  • The article is informative and clearly written, but crashdumps are more useful for determining kernel software problems than hardware ones.

    If the system is a stable release, and has been running without crashes for about a year, I'd start by running diagnonstics on the hardware - specifically, memory and disk - before trying to debug the kernel.

  • Ummm.... (Score:2, Interesting)

    by Anonymous Coward
    FreeBSD-3.5 hemsut 7:07AM up 822 days, 06:32, 2 users, load averages: 1.17, 1.15, 1.10

    What are you people complaining about?
  • ...it doesnt panic. in camly states that there is some trouble, then it fixes it and makes your system run 10% faster to apologize.
  • Because if it is then by all means spend time doing this. Else, just spend some money on better hardware - probably memory. The cost of that hardware is probably far less then the cost of your labor.

    If you need to build an insto-recovery system for a network of identical machines, that is something different. By all means create an ability to rapid rebuild a blown system and recover the last incremental backup. But otherwise don't try to make a hardware problem into a software solution.

Real Programmers don't eat quiche. They eat Twinkies and Szechwan food.

Working...