/ IT Problem Information

IT Problem Information

Update #1

One of the main storage disks on helios has encountered problems and shifted into read-only mode.

The possibility of a persistent hardware problem and/or filesystem corruption requires a full filesystem integrity check. To perform that check, all services that read to or write from the affected disk must be shut down. This includes most email and some web services.

Recovery of the filesystem journal occurred without incident. A full filesystem check is now in progress. It may take several hours.

My apologies for the inconvenience.

Steven Butterworth, Physics Computing Services


Update #2 (Tue Dec 20, 21:43)

The filesystem check completed at approximately 20:25. The filesystem was recovered with some very minor corruption.

Unfortunately, I have been unable to restore service. The RAID controller cache battery has apparently died. In principle, this should force us to move from the higher-performance write-back mode to lower-performance write-through mode, with the added penalty of a higher probability of filesystem corruption in the event of a power failure. However, it appears to be forcing the filesystem to be exported as read-only at the RAID controller layer, prior to the operating system having the ability to make any decisions on the matter.

If I cannot find a way to convince the RAID system to allow us to write to it, the only option will be to copy all of the data elsewhere and then mount that filesystem back onto helios.

The only bright spot in this is that I hooked up our newly acquired storage arrays this afternoon in order to permit me to work on them from home over the break. It appears we may be forced to bring them into service without the desired amount of testing or deployment planning.

If we are forced to go to that point, I will restore service for the functioning filesystem (representing perhaps 25% of the department membership) while I copy the data over to the new storage server. The people on that server will be stuck in read-only mode until completion of the data migration sometime tomorrow.

 Update #3 (Tue Dec 20, 22:10)

Core email services are functioning again. The /home filesystem now appears to be writable and email is being delivered again. Since I am not sure what prompted the original switch to read-only mode, we are definitely operating in low-confidence mode at the moment.

Since the near-term plan is to move all user data over to the new storage nodes anyway, the process will begin immediately.