The users of an Imaging system reported a new problem a couple of weeks ago.
It seemed the computer attached to the Imaging system was not starting up reliably, so I was called in.
The problem
The problem seemingly started all of a sudden.
When they turned the computer on in the morning, it would do the BIOS self-test (POST), then sit there with a blank
screen and not even show the Windows XP start up screen (the computer is a PC running Windows XP SP2).
They discovered through persistence that if they turned the power back off, then on again, and repeated this
about 5 times on average, the system would suddenly come to life and boot up correctly.
Once the system was started, it kept running properly unless turned off, at which point the problem of
the Intermittent Boot would reappear at the next boot.
Wrong Diagnosis
I observed that when power was applied, the computer flashed the bios self-test screen apparently correctly,
then beeped once and the screen went blank, and stayed blank without any other activity.
At those times when it did start correctly, after the beep there was a short pause and then the XP boot logo
appeared, and the system started normally.
The beep is when the CPU starts reading the disk, so obviously there was some problem reading or
finding the disk.
The PC is attached to an expensive imaging system, and is part of the turnkey system with the rest of
the equipment.
The users needed to use the imaging system for a series of important daily experiments the next few days,
and due to the number of add-on cards and attached cables, it did not seem safe to open and dismantle the
computer case.
Moreover, the PC seemed to keep running once it was started, so I advised them to just leave the power on
at nights.
The diagnosis seemed simple, because I had seen similar symptoms a few times in the past.
It must be that the hard drive was not spinning up properly, due to failing bearings or motor, and
just by chance it would spin up correctly about every fifth attempt or so.
Checking the disk properties (in Device Manager) showed the hard drive (there was only one) to be
a 120 GB Seagate Barracuda (model ST3120023A), about 5 years old. It was attached to the IDE/ATA controller
as the only (master) drive.
This worried me a little bit, because all previous cases of this type I had seen had been with Western Digital
brand drives (mostly the Caviar series), but it was possible that any drive could develop this problem.
Again, it was not easy to open the case and hold the drive in my hand to confirm the diagnosis, so I
told the users that I was about 90% sure the hard drive was failing and needed to be replaced.
Since the imaging system uses very specialized software which is difficult or expensive to reinstall, I
suggested that we purchase a new identically sized drive, and make a mirror image of the existing drive
onto the new drive, then replace the failing drive with the new one.
No data or programs would need to be reinstalled, and the failing motor problem would be fixed.
The users agreed, and a new disk, Seagate Barracuda 120 GB model ST3120814A, was ordered for about $60.
Atempted Solutions
A couple of days ago the users were done with the experiments, so I turned off the imaging computer,
removed the hard drive, and made a mirror copy of it using Norton Ghost 2003 on my work computer.
Making disk mirrors with Norton Ghost 2003 is not without risk, but I will save that for
some future blog, and only note that the source "failing" disk spun up on the very first try, and a mirror
was successfully made to the new "good" disk in about 2+ hours.
The next day I went back and installed the new disk in place of the old, being careful to set jumpers
correctly and making sure no cables were knocked loose.
When I turned it on, however, the system would not boot, and seemed to have the same blank screen as before.
This was confirmed by turning the power off and on a few times, and after about the third attempt the
system suddenly sprang to life and booted up normally.
Turning it off and on a few more times confirmed what it seemed like - there was no real change, the
startup was just as intermittent as before the disk was replaced!
This was disappointing, and I looked for a possible reason in the bios setup menus.
Changing the boot order (i.e. moving the CD-ROM before the disk, or making the hard drive be the first boot
device) made no difference.
Toggled the "Slow Boot" setting between Enabled and Disabled, but the only difference this seemed to make
was to add about 30 secs to the bios self-test while a memory test was done.
The hard drive and CD-ROM were correctly identified in the bios each time, even when the boot would fail.
I did notice that if I made a small change in the bios and then used the F10 key to "Save the changes", the
system booted correctly each of those times, while if I quit the bios setup with the Esc key, the boot
process was likely to fail.
At the time this could be coincidence, but later attempts confirmed this to be always the case.
Suspecting a power supply problem, I consulted Dan Y. and Dave M. from the Medical Electronics Lab
(http://www.mel.wisc.edu/).
Between them they can solve just about any problem,
plus they own a portable scope and a multimeter, and are always ready
to help.
We opened the case again, and they connected the scope and meter to the power supply, but turning the
computer off/on simply confirmed that there was nothing wrong with the power supply.
All voltages were normal, and there were no dropouts on the oscilloscope.
We tried enabling the "Hard drive pre-delay" in the bios, but setting it to 9 seconds made no difference.
We did note the motherboard model number Intel D850EMV2, and the bios rev. level MV8510A.86A.0057.P20.0210251634,
during these attempts.
We also confirmed that the system was sure to start correctly any time we made even a small change to
the bios and then used F10 to save the changes.
This made us suspect a problem with the bios, possibly a weak battery, but the clock was always accurate, and
none of the other bios parameters seemed wrong, so that did not seem very likely.
The solution
While we were debating what to do next, Dan (or was it Dave) noticed a small box sitting behind the PC and
observed that it was an external USB drive that was spinning.
It turned out to be a Lacie 2.5" USB hard drive, and was being powered by the usb cable alone.
There was a chance that an external drive without its own power supply could overload the motherboard, so we unplugged
the external drive and tried to boot again.
This time the system booted the very first time, and trying it three times in a row confirmed it - the
problem was fixed - it had been caused by the external USB drive all along!
Follow-up
Thinking that drawing power from the motherboard via the usb cable might be to blame, we connected the
external drive via a powered usb hub, but this simply resulted in the return of the original intermittet
boot problem, so that was not it.
It is apparent the problem is caused by a shortcoming in the bios.
If a usb drive is plugged in during the power-up process, the system gets hung trying to access the usb device,
and fails to boot from the hard drive.
Using web search, Dave found the following link:
http://downloadfinder.intel.com/...
which pertains to our Intel motherboard (model D850EMV2). The release notes for the first bios update on that
page are for Bios update MV85010A.86A.0069.P25.0304170949 and they contain the following among the list of improvements:
"Fixed issue where certain USB Mass Storage devices were hanging the system in POST."
This makes it even more likely that the problem may be fixed by updating the bios, but since the present
workaround of leaving the usb drive unplugged during power-up works well we will likely use that for now.
I can now see the two places where I could have done things differently to arrive at the correct
solution faster (1) I made a diagnosis of the problem but failed to confirm it. In those cases where a
hard drive fails to spin up, it can be confirmed by holding the drive in your hand while power is applied.
I would normally always do that, but ignored it here because the machine was needed for daily experiments and
the many cables and devices attached to it made access difficult, and (2) I failed to give more thought to the alternate
reason - that some other device was hanging up the boot process. I considered it briefly, but did
not follow up for a couple of reasons - the users had not changed the hardware or software lately (or so I
thought) and the machine would start every fifth time or so, which is not typical if some other device is
the cause of the hangup. I looked briefly in the back but there were many cables plugged in (for
microscopes, amplifiers etc.) and the users seemed reluctant to disconnect them for fear of making things
worse. Still, I failed to note the small USB drive attached to the rear - obviously about the same time as
the problem began.
That is for next time.
Comments
(in order from older to newest)