DCSC logo
+Open all         -Close all


After three years of operation Horseshoe 9 will be powered down July 3, 2017.
All disk systems used with Horseshoe, i.e., wkspaceX and home directories have also reached their end of life.
Starting September 1, 2017 all /wkspaceX directories will stop accepting writes, and by September 29, all disk systems will be powered down, and the frontends closed down.
Any data left on the Horseshoe system after September 29 will be lost!
For quite some time, the HS9 GPU nodes have occasionally died under heavy load due to a hardware error. In an effort to debug this we have reserved four gpu nodes since the beginning of September. Everything should now be working - let us know if you find any issues.
/wkspace1 and /wkspace4 have reached their end of life.
Starting November 1, 2015 both file servers will stop accepting writes, and December 1, 2015 they will be powered down.
All data left on the servers after December 1, 2015 will be LOST.
Please move all your data to the new file server /wkspace5.
After serving us for exactly 3 years, horseshoe8 will be powered down October 31, 2015.
A new file server has been made available as /wkspace5. The file server is configured with the following quotas: sdufs (15T), sduhjj (15T), sduom (15T), sdusr (5T), and dcscfr (5T).
It is now possible to apply for access to Abacus for the next allocation period starting November 1, 2015. Deadline October 1, 2015. For further information, see https://deic.sdu.dk/call-for-access
9:30: fe9 currently running very unstable and reboots itself every 15-60 minutes. 12:30: Fixed.
Version 15.0.1 of the Intel compiler suite is now available. It will become the default Intel compiler in January 2015, if no issues are found.
module purge ; module add intelcc/2015.1
December 2, 2014, 7:45 Horseshoe6 and Horseshoe7 was turned off for the last time.
December 1, 2014, Horseshoe6 and Horseshoe7 will be powered down for the last time to make our server room ready for the coming national HPC. All running jobs will be closed down.
Tuesday October 28, 2014 some of the critical infrastructure components in the cluster will be moved to a new location to make room for the coming national HPC.

Access to the following filesystems will be unavailable for a (hopefully) short period of time between 9:00 and 12:00 during the relocation:

  • /home/disk2 (home directories)
  • /wkspace4
  • /wkspace1

Horseshoe6, 8 and 9 will be kept alive, but jobs may fail if they depend on one or more of the affected filesystems.

Horseshoe7 will be powered down during the relocation and is unavailable between 11:00-16:00. All active jobs on HS7 will be terminated.

9:45: HS9 is currently down. We expect it to be up again very soon.
13:15: All nodes are online again.
9:45: qserver (hs9) is currently down due to file system issues. We expect it to be up again very soon.
13:00: qserver is up and running again
Horseshoe 9 is now fully operational. Happy number crunching.
All nodes including frontends were down today due to a partial power outage at SDU.
2014-07-04: Parts of the InfiniBand equipment did not restart properly after the power outage. This is fixed now.
Horseshoe 9 is now (almost) ready for production. Most things have been setup, but there are still issues to be solved before declaring hs9 completely up and running.
You can start using the cluster. If you find any issues that we need to solve, please let us know at admin@dcsc.sdu.dk. Also note that we initially reserve the right to restart nodes etc if needed for testing the system.
The express node on hs7 has been updated to our newest image, including CUDA 6.0. The remaining hs7 nodes will be upgrading soon.
New modules available: amber/14.2014.04, gcc/4.9.0, and intelcc/2013.1.046. See http://dcsc.sdu.dk/docs/modules/ for further information.
Today, after serving us well in 5 years and 25 days (96222 jobs), Horseshoe5 has been powered down.
The hardware will be dismantled and removed, to make room for our new Horseshoe9 cluster.
The express node on fe7 has been upgraded to CUDA 5.5.
9:00-10:00: A failure in our virtual environment has caused downtime on most of our infrastructure servers. Clusternodes and fileservers has not been interruptet by this error, but queue management has been down. All systems are now back to normal opration.
New versions of several modules available gromacs/4.6.5 numpy/1.8.0, etc. See http://dcsc.sdu.dk/docs/modules/ for further information.
15:50-18:30: All nodes including fe5-fe8 are down due to a major power outage.
We are currently experiencing degraded performance on disk4. disk4 is currently rebuilding due to disk failure. The rebuild is now done.
New software available: gcc/4.8.2, openmpi/1.7.3, gaussian/09-D01-2013, and gv/5.0.8.
See http://dcsc.sdu.dk/docs/modules/ for further information.
The newest version of the AMD Core Math Library (ACML) has been added to our Modules standard software - see http://dcsc.sdu.dk/docs/modules/ for further information.
20:00 - 22:30: The fileserver disk2, hosting all users home folders, was down again.
12:30 - 13:30: The fileserver disk2, hosting all users home folders, was down.
The frontend fe.dcsc.sdu.dk has been upgraded with new hardware and software.
13:15: We have had a 2 sek power outage, which caused 1/3 of the nodes to reboot.
15:45: /wkspace1 is up again.
11:30: /wkspace1 is down. The fileserver has crashed.
The system will be online again as soon as fsck has finished.
The Horseshoe 6 upgrade is now successfully completed, and all nodes are operational again.
Software upgrade on Horseshoe 6.
Same software versions as installed on HS8, HS7 and HS5 will be installed on HS6.
FE6 will be down from 11:45-12:15.
All jobs on HS6 started after 12:00 at 2013-05-29 will be executed on nodes with new software.
The Horseshoe 5 upgrade is now successfully completed.
16 Horseshoe 6 nodes powered down.
Due to cooling constraints we have been forced to power down part of HS6. We expect to have normal cooling again within 3-4 weeks.
Software upgrade on Horseshoe 5.
Same software versions as installed on HS7 and 8 will be installed on HS5. As of today all new jobs on HS5 will be executed on nodes with new software.
Software upgrade on Horseshoe 8.
Same software versions as we just installed on HS7 will be installed on HS8. As of today all new jobs on HS8 will be executed on nodes with new software.
The Horseshoe 7 upgrade is now successfully completed.
This week we will upgrade the software on Horseshoe 7.
CentOS 6.3, CUDA 5.0, OpenMPI 1.6.4, GCC 4.4.6 and OFED 3.5 are some of the software highlights. As of today all new jobs on HS7 will be executed on nodes with new software.
A malfunctioning ethernet switch servicing Horseshoe 7 has been identified and replaced this morning.
One of the Horseshoe5 racks has been shut down to make room (power/cooling) to the new Horseshoe8. The 40 old nodes will be used as spareparts for the remaining 67 HS5 nodes.
Procurement of Horseshoe generation 8 has finished.
A frontendnode and 27 computenodes will be installed August/September 2012. Each compute node will be equipped with two 8-cores 2.4 GHz Intel/Sandy Bridge CPUs and 2x600GB 15krpm disks, 6 of them with 128GB memory and 21 nodes with 64GB memory. The vendor of the nodes is Dell.
22:00: /wkspace4 has been down since 15:30 yesterday The fileserver has been power cycled and the filesystem is online again.
08:30: Part of Campus suffered a powerfailure. The clusters powered off af around 4:56 due to the missing power.
Most systems were back to normal operation at 8:30
09:15: /wkspace1 is up and running. All systems are back to normal.
19:00: Most of the systems are online again. The rest ("/wkspace1" and some less important systems) will be fixed monday morning.
18:00: Part of Campus suffered a powerfailure due to a cable fail near Campus. The cluster powered off af around 18:00 due to the missing power.
Today the entire Horseshoe installation was downed from 8:00 to 13:00 due to planned maintenace of a main power distribution panel.
Everything went as planned and all systems are up and running again. While the power was out anyway we prepared our own power distribution panel for the new Horseshoe7 installation.
DCSC/SDU has finished the procurement of hardware to the new Horseshoe7 cluster and we are happy to annonce that we have chosen Fujitsu as our supplier of:
  • 11 compute nodes with Nvidia Fermi GPU's and Infiniband interconnect.
  • 1 test and optimisation node with Nvidia Fermi GPU's and Infiniband interconnect.
  • 1 frontend server with Infiniband interconnect.
Details about the equipment:
  • All nodes have two hexcore Intel X5670 2.93 Ghz CPUs, 48 GB memory and 4TB GB disk.
  • The compute nodes and the test node are equipped with two Nvidia C2070 GPU's each.
  • Hardware installation, racks, switches, cables etc. are included the agreement.
The Horseshoe7 cluster is scheduled to be operational within Q1 2011.
The frontend server "fe.dcsc.sdu.dk" was down from 13:00 - 14:30 due to problems with the filesystem.
Update: We took advantage of the power outage this tuesday and finished the reconfiguration of the power panel while the cluster was down anyway.
Due to a reconfiguration of a main electrical distribution panel near our cluster facility, the entire Horseshoe installation will be without power Thursday 21st October from 8:00 to 12:00.
To ensure a safe shutdown of the cluster, all access to the Horseshoe will stopped at 7:45 Thursday 21st October, and all running jobs will be terminated. We have been promised that power will be restored at latest at 12:00 the same day. As soon as power has been restored we will resume normal operation.
This morning at 9:32 the entire campus suffered a complete power outage due to a problem in the main power distribution system. Power was restored around 45 minutes later.
We used this unplanned downtime to fix the reconfiguration of the power distribution panel planned for Thursday, so the shutdown planned for Thursday morning has been canceled.
All systems were back in normal operation Wednesday at 16:00.
At 4 o'clock this morning we have had a disk failure on one of the main infrastructure servers.
Around 9 o'clock the error made the webserver, dhcpserver, NIS-server and both the queue-servers stop.
Because the dhcp-server was down, many nodes lost their ethernet IP- addresses.
Normal operation restored around 13:00.
Horseshoe5 is ready for business again. All hardware has been transfered to the new datacenter, and all the nodes has been reinstalled with software similar to Horseshoe6
The migration of /people/disk2 has finsihed and the filserver is available again. Horseshoe6 is ready for business again.
Today we migrate all data to new fileserver. As announced by email, we will have shutdown the entire Horseshoe for the rest of the day, to avoid dataloss.
Horseshoe5 has to be moved to a new datacenter. As announced by email Horseshoe 5 will be down i the period 26-30 April. When HS5 starts up in the datacenter the nodes will be installed with software similar to the Horseshoe6 software.
The new Horseshoe6 cluster is now open for business.
Please use fe6.dcsc.sdu.dk as frontend to new cluster. fe6 is binary compatible with the compute nodes. please be patient and report any problems you might experience.
DCSC/SDU has finished the procurement, and have chosen IBM as supplier of the Horseshoe expansion:
264 nodes with Infiniband QDR interconnect.
  • 240 nodes have two quadcore Intel Nehalem 2.66 Ghz CPUs, 24 GB memory and 500 GB disk.
  • 24 nodes have two quadcore Intel Nehalem 2.66 Ghz CPUs, 48 GB memory and 2 TB disk.
As announced by email the Horseshoe4 cluster has reached it's end of life.
All running job will be allowed to finish, but as of today no new jobs will be started. Queued or blocked jobs will deleted.
Update: The controller in the raidbox is broken and has corrupted all data on the disk. The data on the /wkspace2 partion is unfortunately lost. We will try to find new hardware to contruct a replacement for /wkspace2.
It looks like the filesystem is in a very bad condition. It might be necessary to wipe it and start from scratch.
The /wkspace2 filesystem is behaving bad again.
The filesystem has been taken down for maintenance.
The /wkspace2 filesystem has been down again. The filsystem has been repaired and the fileserver was restarted.
The /wkspace2 filesystem has been down. The filsystem has been repaired and the fileserver was restarted.
We had a powerfailure on a part on the Horseshoe4 cluster, which caused "Switch left" and 1/3 of the left-nodes to power down.
The /wkspace1 filesystem is down. We are currently trying to copy the data to a healthy disk.
2009-06-29: /wkspace1 is online again.
The new Horseshoe5 cluster is now open for business.
Please use fe2.dcsc.sdu.dk as frontend to new cluster. fe2 is binary compatible with the compute nodes.
Since the cluster more or less is starting from scratch (we are now using Infiniband as primary interconnect and CentOS 5.2 as the OS) please be patient and report any problems you might experience.
The website is under reconstruction.
The information on specific installations has been moved to resources along with it's corresponding statistics. New system statistics graphs will be available soon from the main menu.
12:00: The connection from the frontend to parts of the cluster has been tempoarily down for a short while.
16:00 - 18:15: We have had some stability problems with the network, but the network is back to normal now.
20:00: The filesystem /people/disk2 had some errors and needed a check, but it is up and running again.
The mirror will be rebuilding for the next 10 hours. In that period of time the performance will be slightly degraded.
18:00: The fileserver Disk2 is down.
This includes the filesystems "/people/disk2" and "/wkspace2".
We will look into problem ASAP.
Disk6 is up again.
Disk6 is down.
Disk6 is down the rest of the day for maintenance.
We welcome two new usergroups to the Horseshoe: sdufs and sduhjj.
The users in the sdufj are as of today no longer allowed to execute jobs on the nodes.
Some of the sdufj users have been transfered to the sduhjj group.
The frontend server was restarted. We had to replace a failing memory module.
The new DCSC design was applied to the website.
The server "disk6" will be upgraded to OpenSuSE 10.3 and extra storage capacity will be added.
2008-01-24 16:30 The upgrade has finished. Selected users in the SDUOM group has now access to 3TB new storage on "/wkspace6b"
Today we will replace a lot of the old harddrives. We hope not to disturb any users...
13:00 - "/people/disk1" (KVL users's homedir) is down. One of the old harddrives failed during the cloning of the drive.
13:14 - Only one of the disksets in the mirror failed. The filesystem is online again, but for now without double redundancy (still RAID5).
17:45 - We had to restart the fileservers. Due to this, all homedirectories were unavailable for about 5 minutes around 17:30 today.
18:00 - "/people/disk1" and "/people/disk2" are rebuilding their mirrors. Expect degraded performance until tomorrow morning.
23:59 - Two nodes had stale nfs handles efter the reboot. The problem was fixed at 23:45.
To reduce fairshare fluctuations it has been decided to limit the number of jobs in the queue pr. user to 250
16:15 Problem solved. Access to "/people/disk2" is back to normal.
14:30 There seems to be a problem with "/people/disk2". We are looking in to the problem.
The NFS-service on Disk1 died. Efter a server restart normal operation was restored.
Some of the queue tools did not respond on the frontend. A server restart solved the problem.
12:52 The failing disk has been replaced and the filesystem is available again.
12:29 /people/disk2 is down due to a disk error.
12:25 The problem was solved by powercycling the switch stack.
11:43 The switch serving nodes s41p01 to s41p24 lost it's connection to the switch stack.
16:50 All nodes except one are up. The last node awaits a visit by a Dell technician. Case closed.
15:50 Most of the nodes are online again, only 8 nodes remains down.
14:40 It seems like we were hit by a very short power outage. 71 nodes powered down as a result the power fail.
The second check of the filesystem /people/disk1 finished earlier than expected.
The filesystem is now back in service.
The problem with /people/disk1 requires an additional filesystem check.
We expect the check to be completed Thursday 2007-08-23.
We have a problem with /people/disk1, with the result that kvljg affiliated users at present cannot connect to their home directories or run jobs.
It is necessary to run a filesystem check, which probably will take all the weekend to complete because of the large number of files in the filesystem.
12:00 Disk1 is back online. The network interface died and locked up the system. A new network interface has been installed.
10:30 The Server Disk1, which hosts the partions /people/disk1 and /wkspace1 are down.
13:40 The filesystem /wkspace1 is back online. All important files were restored.
The filesystem /wkspace1 is down. We are experiencing multiple disk errors on the RAID tower, on which /wkspace1 is placed. We'll salvage as much data from the partition as possible, and rebuild /wkspace1 with new hard drives.
We expect /wkspace1 to be unavailable for approximately one week.
13:40 The filesystem /wkspace1 is back online. A new SCSI cable has been installed.
11:50 The filesystem /wkspace1 is down again. It looks like a bad SCSI controller or a faulty cable.
The filesystem /wkspace1 is back online after a minor breakdown in the SCSI communication.
13:30 /people/disk2 is online again.
13:15 /people/disk2 is down.
22:00 /people/disk2 is online again.
18:00 /people/disk2 is down.
/people/disk1 is now available again.
The cluster has been restarted, and is 'open for business'. We do have a problem with /people/disk1, with the result that kvljg affiliated users at present cannot run jobs.
At about 6:09 this morning the cluster was hit by a powerfailure. As soon as weather (and road conditions) permit, the cluster will be restarted.
A new FAQ web-page is now available with a couple of recommendations for how-to use the new multi-core nodes more effectively. Please visit: Jobs on multi-core machines.
Also visit the new Nodeload web- page where jobs can be inspected for efficient use of the nodes.
The Horseshoe is now open for business. Please use fe4.dcsc.sdu.dk as the new frontend to compile and manage your jobs. fe4 is binary compatible with the compute nodes. fe1 and fe2 will be turned off.
Since the cluster more or less is starting from scratch (we are now using OpenSuSE 10.1 as the OS, the version of TORQUE/MAUI for queue management is also a bit different) please be patient and report any problems you may find when using the system. The queue commands (qsub, qdel, and qstat) should behave as we are used to.
Please observe that your application will benefit from a recompilation since the CPU's on the compute nodes are Intel Woodcrest, in the family of EMT64 Intel CPUs. We'll send out more information about this shortly.
The RAID system /people/disk2 has been repaired - it's now possible to log on to the old frontend machines (fe1 and fe2) - and to the new frontend fe4.
The new frontend (fe4) has the newest versions of the Intel, PortLand Group, and Gnu compilers to create binaries compatible with x86_64 architecture of our new compute nodes.
We expect to open the Horseshoe for normal business on Monday 8/21 at 10:00.
We have some outstanding issues with the fileserver called disk2 which we expect to be resolved before Monday. As soon as /people/disk2 is available again we'll post a notice on this web-site.
The Horseshoe will, until Monday morning, be in use for producing the best possible Linpack score for a listing on top500.
Today we finished mounting the new servers in the racks and reconnecting the network. We still have to adjust BIOS setting and install a operating system on the 200 servers. When this has been accomplished we'll attempt to get Horseshoe listed on Top500, before we are back in normal operation.
We hope to take delivery of 200 servers from Dell Monday 8/7.
The racks and electric installations are almost in place. In a couple of days we hope to announce a date for normal resumption of service on the Horseshoe cluster.
Horseshoe remodeling:
giga2 will close on 31/7/06 08:00. A computer broker will come to collect the PC's.
Hoeseshoe will reopen again in the middle of August with 200 quad-core (woodcrest) servers supplied by Dell.
Until the reopening the fileservers should be on-line most of the time.
15:30: The Horseshoe cluster should be ready for business again.
Yesterday the SDU campus suffered a total loss of electricity, consequently all nodes and servers stopped. Since we still have a unstable power supply situation, the nodes, disk4, disk5, disk6, and disk7 will remain powered down for the time being.
On Thursday 6/1/06 at 08:00 the nodes constituting the workq and giga queues will be turned off. After that time only the SDU and KVL based groups will have access to the Horseshoe computing resources. Users belonging to other groups can still log in and retrieve files.
The filesystem on /people/disk2 has developed a problem requiring a thorough testing of the integrity. This should be completed before 20:00 local Danish time.
20:30. The testing (and repair) of /people/disk2 terminated at about 20:50. After restarting the fileserver it became clear that the nodes had invalid (NFS) mountpoints, thus necessitating a reboot of all nodes.
A leak has been found in the compressors coolant loop - we don't have an ETA for the repairs - since tomorrow is a holiday in Denmark, we do not expect any news until Monday next week.
The cooling compressor has lost it's supply of coolant. We expect the compressor to be back in operation at about 14:00.
At 19:00 we experienced a powerfailure in the room housing the giga2 machines aand the fileservers disk4,..,disk7. Most of the systems are affected. Facilities management will be notified tomorrow morning.
disk2 had NFS problems. Restarted.
For reasons which cannot be determined disk2 dropped the filesystem for /people/disk2 to night.
The filesystem has been repaired and is online again.
23:00: The fileserver disk2 has been unstable since about 18:00 this evening, probably due to the dropout of one of the raidtowers supporting /people/disk2. The problem will be attended to ASAP.
The migration to TORQUE has (more or less) been completed. We still have some nodes which for (hardware-) technical reasons have to be "debugged".
TORQUE migration update: As of noon we have completed the migration in queues giga and giga2. We have migrated 338 nodes in workq, and are still waiting for 59 old PBSPro jobs to terminate.
TORQUE migration update: As of noon we have migrated 171 nodes in workq, 136 nodes in giga, and 260 in giga2 to TORQUE management. Thus all functional nodes in giga has been dealt with. We still have some nodes in giga2, which for (hardware-)technical reasons cannot be migrated.
As of noon we have migrated 43 nodes in workq, 64 nodes in giga, and 97 nodes in giga2 to TORQUE management. Nodes will be added as the old PBSPro jobs terminate. Please remember that standard output and standard error (the .o and .e files) will be delivered when the old job terminates, but an e-mail will not be send.
Hence you should check http://www.dcsc.sdu.dk/docs/load/PBSPRO/expanded_queue_info.php once in a while to remember: a) which jobs you had running and b) which jobs you had queued as of yesterday @ 13:00.
Job scripts (and job control files) for old jobs can be reclaimed here: Scripts
Migration to TORQUE queuing system on 10/10/05 The license for the PBSPro queuing system used on The Horseshoe is about to expire. Until now we have been granted a free academic license by Altair Engineering. However, Altair has changed it's policy and are now asking us for a large fee to maintain the license.
This is not feasible for us, so on Monday 10/10 @ 13:00 we'll migrate to the open-source TORQUE queuing system in order to maintain a similar user interface. In that respect we are following in the footsteps of many large academic computing centers formerly using PBSPro. In fact the TORQUE version of 'qsub', 'qstat', and 'qdel' commands are (almost) identical to the PBSPro counterparts.
The plan for the migration is as follows:
  1. Running jobs at 13:00 Monday 10/10 will be allowed to finish. However, there will be no e-mails from the system upon job termination, if this was requested at time of job-submission.
  2. Jobs in the queued state at 13:00 Monday 10/10 CANNOT BE MIGRATED to TORQUE, since the format of the job-control files are too different. This unfortunately implies that queued jobs have to be resubmitted.
  3. When PBSPro jobs terminate the nodes allocated to them will be placed under TORQUE control, which implies that at about 15:00 Wednesday 12/10 all nodes in 'giga' and 'giga2' will have been migrated to TORQUE, and at the latest Tuesday 18/10 the last of the 'workq' nodes will be under TORQUE control.
  4. Our scheduler, MAUI, will migrate seamlessly to use TORQUE such that the fairshare information at the time of changeover will be preserved.
To help figuring out which jobs are running or queued at the time PBSPro is "turned off" the "Expanded queue information" webpage ( expanded_queue_info.php) will be saved here:
For each queued job we'll also copy the submitted jobscript to
where they will be listed as job-id.infra.SC. This will hopefully be of help to you when resubmitting jobs or submitting new jobs.
We realize this is a fairly short notice, and that the migration is not entirely smooth, due to the fact that we cannot migrate queued PBSPro jobs to TORQUE. However, we hope that the migration to TORQUE can happen without too many problems.
The upgrade of the raidtowers has been concluded successfully, albeit our timetable didn't hold (we finished about 3 hours later than planned). We hope that the new raidtowers will prove more stable, and that the additional diskspace will find its use.
Upgrade on 12/9/05 - deployment of new raidtowers We have purchased new raid-towers to replace the 2 AT1600 systems which have caused us quite a bit of headache, and one X3i system, as it's 3 year warranty period has expired.
In order to perform the upgrade we need to bring all running jobs to a halt, to ensure consistent filesystems while the raidtowers are replaced. The bad news is that we need about 3 hours to complete the installation, the good news is that we can enlarge the following partitions:
/people/disk2 from 1,92TB to about 2,33TB
/people/disk1 from 1,50TB to about 2,33TB
/wkspace2 from 1,92TB to about 3,26TB
/wkspace1 will not receive an upgrade this time around as we'll continue to use the old X3i tower (and use the just replaced X3i as a spare parts repository, until a replacement can be funded - hopefully soon).
The plan is thus to shutdown the cluster from 09:00 until approximately noon on 12/9/05. HOWEVER, ALL RUNNING JOBS ON 12/9/05 AT 9AM WILL BE FLUSHED OUT OF THE QUEUES. The queues will remain open to receive new submitted job - but if you think that a queued job should NOT start until after the upgrade of the filesystems it can be put on hold with the command:
rsh infra /usr/pbs/bin/qhold -h u job-id
and later released with:
rsh infra /usr/pbs/bin/qrls -h u job-id
The giga2 is online again.
The situation is not resolved but will be further investigated when personnel return for summer vacation.
Again we lost a phase in one of our 63A feed for the giga2 machines. About 33% of the machines were without power so long that the power supply had to give up. The electrics guys are informed.
The giga2 queue is suspended until this is investigated. When is unknown as of this late time (16:45).
To correct the "missing T-phase" issue all machines in giga2 will have to be shut off.
This will happen (still unconfirmed)
Wednesday July 13 shortly after noon.

The giga2 queue is currently suspended and the currently running jobs will have sufficient time to terminate normally.
giga2 will be online again Wednesday afternoon.
It turns out that the "blown fuses" below is a missing T-phase on 14 out of 20 groups.
The repair necessitates a full shutdown of all electricity in the room covering giga2 including the servers, disk4-7, positioned in the same room.
The necessary spare-parts are not at hand but we have been informed that the electricians will be able to perform the repair Wednesday afternoon. This still has to be confirmed though.
Users will be notified by email.
Between 13:43:12 and 13:48:29 we had no power in the room holding the giga2 machines.
Not all machines are up due to blown fuses. This will be corrected as soon as possible.
Unfortunately the RAID-tower housing /wkspace1 has suffered the loss of 2 disks simultaneously - since the partition is a RAID-5 device, this implies the total loss of data.
We are in contact with the vendor to have the faulty disks replaced as soon as possible.
08:05: The RAID-tower housing /wkspace1 has crashed, due to a faulty disk during reconstruction of the RAID5 partition (brought on by another faulty disk). We'll initiate repairs as soon as possible.
23:05: The filesystem check of /people/disk2 has finally terminated. It appears that no files have been lost. The filesystem is thus again available for use.
09:55: Once again we lost one of the Fibrenetix RAID boxes due to a simple disk failure. In order to stablize the SCSI bus we have restarted disk2 but unfortunately the boot process has seen some errors in the /people/disk2 filesystem and is therefore currently running the check program. This will take some hours.
Due to an incident where infra got unstable we had to perform at hard reboot. Unfortunately we lost all running jobs in the process.
Continuation of the cluster.
DCSC has approved our application for extending the lifetime of the initial pool of 512 machines to 1/6-2006. It should be noted, however, that the warranty expires 1/8-2005, and no hardware replacements will be done after that date. Machines that die after that date will be cannibalized to keep as many machines running as possible.
Odense went black for 20 minutes
but we only had a power surge between 7:21:17 and 7:21:28.
Nevertheless we lost all nodes. All servers stayed online due to UPSes.
The raidsystem housing /people/disk2 has lost a hard-disk, this triggered a SCSI error on the server (disk2), such that the partition is unavailable. The raidsystem is currently (at 17:00) rebuilding the raid device - we have to wait for this to complete before we can verify the filesystem, and release /people/disk2 for use. We expect this to finish at the latest 11:00 11/12/2004.
23:00: /people/disk2 is back on-line again. Currently the RAID-1 stabilization of /people/disk2 is being regenerated which takes some performance.
The Portland Group Compiler has been upgraded to version 5.2.
It has been arranged with SDU physical facilities that the faulty fuse will be changed Friday October 15 at 14:00.
Looks like we are hit by powerfailure again - a main circuit which supply power to 14 power panels in giga2 has switched off. About 60 nodes are down due to this problem. We may have to drain giga2 in order to repair the circuitry.
Update on powerfailure situation. To repair the broken main fuse we have to cut power to all giga2 nodes. All giga2 nodes have thus been offlined - and the repairs will commence as soon as giga2 has been drained for running jobs.
There is a problem with some of the electrical installations. This evening the cluster lost it's network connection since the media converter from fiber to ethernet had it's power supply connected to one of the power strips with a bad fuse - this has been corrected. However nodes on switch01, switch02, switch03, switch06, and switch07 are affected by the power failure.
We have upgraded the NFS server software on disk2 - we hope this will solve the problem with the unstable NFS service. However, we still have to reboot about 200 nodes to get rid of hanging processes.
We have during the last 24 hours (or so) experienced problems with NFS services from disk2. We are in the process of determining the cause of the problem.
IBM strongly recommends that we upgrade the BIOS of machines in giga2 to avoid some erratic behaviour like network interfaces disappearing, sudden stalls and power-offs. We have seen all these things happen and have therefore decided to offline all giga2 machines Sunday morning so that any job will have finished Tuesday morning. We can do the upgrade simply by rebooting the machines so it should not be a whole-day operation.
One of our key fileservers (disk2) had ceased to service NFS earlier today - it has been rebooted, but many nodes are affected by hanging processes and they must also be rebooted. We have identified the affected nodes (approx. 780 in number). They have been offlined, and will be rebooted and released back to the PBS queues as soon as possible.
Users meeting The users meeting is Thursday 9/9 at 13.15 at the Kollokvirum at the department of chemistry, SDU (same place as last).
  1. The technical setup of the new 304 nodes ("giga2").
  2. Job scheduling and queue setup (workq, giga and giga2).
  3. Future for the 512 "workq" nodes.
  4. Mics.
Regarding point 3): The first 512 nodes were made operational 1/9-2002 and the funding for running these will expire 1/9-2005. Without any active measures, these computing resources will disappear by 1/9-2005, and the research groups behind the original application will loose their access to the system. There are three basic options:
  1. The 512 nodes disappear by 1/9-2005 and are not replaced.
  2. An application is made for funding for running the 512 nodes for e.g another year (i.e. electric power and manpower).
  3. The groups behind the original proposal make individual or a common application for new hardware according to their estimated computational needs for the next 3 years.
Options 2) and 3) should be aware of the 1/11-2004 application deadline to DCSC, i.e. applications must be send in by this date to ensure a continuing supply of computational resources after 1/9-2005.
Queue giga2 is now open. LAM has been upgraded to version 7.0.6, and is now using the Gnu compilers (gcc, g77, c++) as default compilers. Please read LAM-MPI.
Horseshoe reopens The cluster is now partially open for business - there are a couple of more details on the new machines in the 'giga2' queue to deal with, but 'workq' and 'giga' is available (jobs can be submitted to the 'giga2' queue, but will not start).
The web-pages have to be updated to accommodate the new 'giga2' queue, especially the 'expanded queue info' will not be correct for a little while. Other web-pages in the 'Docs' hierarchy also have to updated and a few new how-to's added.
Summary of new features:
  1. All nodes and frontends have been upgraded to a new version of Linux (Debian Sarge). This MAY break some of you applications, because the fundamental system libraries have been upgraded. However the upgrade was (very) necessary ! The problem has to be dealt with by recompiling the application.
  2. PBS now supports more than 128 nodes per job - the limit has been set to 256 nodes in 'workq' and 'giga2'.
  3. giga2 (when open for business) has a 50 hour wall-clock limit as is the case for 'giga'.
  4. We are introducing a procedure for alleviating switch-fragmentation when scheduling multi-cpu jobs.
Due to the addition of nodes, the fairshare targets for the groups has been adjusted to reflect the new allocations on the cluster. The statistics for the last five fairshaire windows has been removed since it cannot be rescaled to accommodate the new number of nodes.
Please remember that there is a usersmeeting Thursday 9/9 at 13.15 at SDU in the kollokvierummet at the Department of Chemistry, where the scheduling strategy and other issues can be discussed.
21:30 - UPDATE on the cluster upgrade. The installation of the nodes with a upgraded version of Linux (Debian Sarge) has been completed. However before we can start the PBS queueing system as few checks have to be performed. As soon as the cluster is ready we'll send out a e-mail.
The Horseshoe is being enlarged with 304 additional nodes. We have now received all 304 PC (3,2 Ghz P4, 1 and 2 GB RAM, 80GB disk, and Gb ethernet). The time table for the merging of the old and new cluster is as follows:
25/8: All running jobs in giga, workq, and express are purged, however queued jobs remain in the PBS database ready for the reopening.
27/8: The Horseshoe will be open for business again, now with an additional queue 'giga2'. All queued jobs are released and will be executed according to their priorities as usual.
This implies that jobs submitted from now until 25/8 cannot expect to complete unless they start with a walltime request which allow them to finish before 25/8.
If you want a job submitted between now and 25/8 to remain queued until after 27/8, you can force it to remain queued by issuing the command:
'rsh infra /usr/pbs/bin/qhold -h u job-id'
These 'suspended' jobs can be located by using the 'qstat' command - the jobs will be in state 'H'. A 'suspended' job can be released to state 'Q' by using the command:
'rsh infra /usr/pbs/bin/qrls -h u job-id'
We'll use the new nodes and the time between now and until 25/8 to evaluate new versions of our queuing software (PBSPro and MAUI/Moab), test new versions of Linux, and tune a version of Linpack for a benchmark run on all cluster nodes between 25/8 and 27/8 in order to place the Horseshoe on the Top500 list of Supercomputer installations (http://www.top500.org).
Once the enlarged Horseshoe reopens, the target %-shares for each user group will be adjusted to reflect the added capacity and we'll announce some new procedures when submitting multi-cpu jobs in order to alleviate the phenomenon of "switch-fragmentation", i.e. the allocation of few nodes on many switches for the job.
We will like to remind you of the user-meeting at SDU (kollokvierummet at the Department of Chemistry) Thursday 9/9 at 13.15. We'll announce an agenda at a later time.
In the last 24 hours we have witnessed 2 network related events affecting The Horseshoe:
  • At about 8:40 in the morning yesterday we saw a tremendous surge in network traffic on the internal cluster network, which caused a kernel crash on about half the nodes - they had to be rebooted. We are not sure what the cause of the network problem was.

  • At about 15:45 yesterday the SDU Campus Network Programmer was forced to shutdown all network traffic in and out of campus due to attacks against the SDU Active Directory structure. The network was reopened again at about 10:45 today.
Infra (PBS and web server) has to have it's motherboard replaced. The server will be down for about 1/2 hour at some point before lunch. The commands 'qsub', 'qstat', and other PBS related commands will not be available. The webpages will also not be updated for that period of time. Running jobs will continue for the duration of the shutdown.
/wkspace1 is again available. As previously announced we could not save the original data on /wkspace1.
Update on the situation with /wkspace1. Unfortunately we have to accept that the data on the partition is lost. We've tried for 5 days to recover the data - but during the process several additional disks have crashed. We are in contact with the vendor to replace the faulty disks ASAP. We'll bring /wkspace1 online as soon as the disks have been replaced (but without the old data).
The RAID-Tower housing /wkspace1 experienced a disk-failure last evening. The RAID-Tower responded by activating the 'hot spare' disk and initiating a rebuild of the RAID5 array - however during the rebuild errors on multiple other disks has been encountered. We'll contact the vendor to obtain replacement disks. We are not sure of the integrity of data on /wkspace1.
The switch where all front ends are attached started continuously rebooting at around 6:30 this morning making it impossible for users to log onto the system to check job and submitting new. Another switch has been replaced.
1 node jobs in queue giga. We have observed that at times the machines in the 'giga' queue have been somewhat underutilized. This is in part a consequence of the fact that the 'giga' queue only has 140 machines, and we allow up to 128 nodes per job. The limit of 4 nodes per job also inhibits an effective backfilling with small jobs.

As a test we have lifted the requirement of 4 nodes per job in 'giga' to see if this will improve the utilization. The 50 hour runtime limit remains.
During the Christmas holidays many of you have experienced very slow access to the filesystems attached to the server 'disk2'. The filesystems in question are '/home/disk2' and '/wkspace2'. This was caused by excessive usage of these filesystems by the running jobs. It is important to realize that the connection between nodes and the file-server is a resource which has to be used sparingly.
Please note that all jobs are required to use the local disk(s) on the node(s) allocated to the jobs, except for initial and final transfer of data. The example PBS scripts available on the website all show how to do this.
Disk2 is down. Late yesterday afternoon there was an event on the SCSI bus connecting our RAID-towers to disk2. We are in the process of scanning the filesystems, a process which should be done later this afternoon.
14:00: Disk2 is up again, and the cluster is ready for 'business'.
As a followup on yesterdays news regarding The Current Backfill Window, there is a new "howto" describing a few tricks which might allow your queued jobs to start sooner: Please read Job resource request management.
The Current Backfill Window The queue information web- page now features a section on the current backfill window, which lists the resources (number of nodes and duration) available to run jobs right now, without disturbing the start-time of the currently highest prioritized (queued) job.
This information can be used to have jobs scheduled right away, e.g. for test jobs or jobs which have a flexible need for number of nodes and/or walltime consumption.
Please also read about the scheduling strategy.
/wkspace2 and 40 giga-nodes are available again. We have finally completed the journey to recover from the collapse of the RAID towers late July. As of this morning /wkspace2 and the 40 'giga' nodes which had emergency-backups of /wkspace2 are now again available for Scientific Computing.
Thank you for your patience during this arduous process.
/home/disk1 and /wkspace1 are available again. Disk1 is now "open" for business again: /home/disk1 and /wkspace1 are now available. For those of you having files on these systems, please check that all is as expected.
Those who placed a (user) hold on jobs in the queue because the jobs needed access to the above mentioned filesystems: Use the command qrls to release the 'userhold' on these jobs:
rsh infra /usr/pbs/bin/qrls -h u job-id
Use the qstat command to locate these jobs, they will be in state H.
Unavailability of /home/disk1 and /wkspace1 26/9 until 3/10. We are now at the next to last step in recovering after the terrible crash of our RAID-Towers on disk2. We have to restructure the filesystems on disk1. We have to take /home/disk1 and /wkspace1 offline for about 5 working days to build-in a new RAID- Tower and secure /home/disk1 on a mirrored RAID-Tower setup.
These filesystems will be unavailable starting Friday 9/26 08:00. disk1 should be fully operational the following Friday.
We ask that you do not submit jobs which ask for either /home/disk1 and /wkspace1 until Friday 3/10.
If you have jobs in queue (*not* running) which require the use of disk1 resources: Please place a hold on those jobs - use qstat to locate the job-id's of these jobs - and then use qhold to place a 'userhold' on these jobs:
rsh infra /usr/pbs/bin/qhold -h u job-id
The Horseshoe will reopen tomorrow September 4 at 14:00. However, we are not quite finished with the process of restructuring and recovering, thus the reopening tomorrow has the following caveats attached:
  • /wkspace2 will not be available for about two weeks.
  • 40 'giga' nodes will be offline until /wkspace2 is restored (they are used as buffer storage for the 1,4TB of data originating from /wkspace2).
  • /home/disk1 and /wkspace1 has not been restructured.
These issues will be resolved during the next few weeks, as we receive refurbished RAID towers from the vendor. More information will be posted as the restructuring plan progresses.
Approximately 1,2TB of data on /home/disk2 has been recovered. Most of the directory structure also seems to be intact. In other words, we feel confident to announce:

Most of /home/disk2 has been salvaged. In the meantime we have taken delivery of two additional RAID towers. These RAID towers have to be tested and installed before access and normal operations resume on the cluster. In addition the faulty hard- disks in the existing RAID towers have to be exchanged. An ETA for the re-opening of The Horseshoe will be posted very soon.
The recovery efforts for /home/disk2 continue despite some setbacks. We will poste more about the situation after the weekend. We have issued a purchase order for two additional RAID towers, which will be delivered next week. Using those new devices /home/disk1 and /home/disk2 will be reconfigured to be hosted on mirrored RAID towers, which will improve the ability to survive disk and RAID tower failures.
We are close to formulating a plan for:
  1. Recovery of data on /home/disk2.
  2. Resuming service on The Horseshoe.
The recovery operation is at the stage where we have a pretty good overview of the amount of low-level information we are able to extract from the RAID towers. How much of the filesystem we are able to reconstruct, we do not know at this time. The vendor will let us use a spare RAID tower for our attempts to rebuild the filesystem.
In order to return to normal operation on The Horseshoe, we need to restructure the layout of filesystems on RAID towers. We are in the process of negotiating with vendors regarding delivery of additional RAID systems to improve our ability to survive hardware failures.
Unfortunately it is to early to offer a realistic timeline for the tasks outlined above. We will continually post information on the website as work progresses.
The RAID system containing /home/disk2 and /wkspace2 has developed into a nightmare. Apparently we have been supplied with a system where all 32 disks were of very poor quality. Since the RAID system just remap data from bad blocks on one disk onto another, these faults have gone unoticed. On the 30th of July, however, a run-away user process created ~500 GB of files, completely filling up /home/disk2. This triggered a sequence of unrecoverable errors. We initially thought that we could recover by switching one disk at a time, and rebuilding the file system after each switch. This was unsuccesful. At this point we are faced with a total loss of all data on /home/disk2, but data on /wkspace2 has been salvaged and stored on compute nodes, which have been taken offline. We are in close contact with the supplier of the RAID system, trying to recover at least some of the data, but hopes are slim.
Needless to say, this is a major disaster. We have no estimates of when we will have a definitive answer for the status of /home/disk2, or when users reciding on /home/disk2 can return to using the machine.
We are still hunting for "faulty" disks. A service representative from the vendor company will come and help us upgrade the firmware on the RAID towers.
The Horseshoe is still down. We are not at the end of the process of finding "faulty" disks in the RAID towers.
fe1 will allow logins - but only users with homedirectories on /home/disk1 can access files.
The Horseshoe is still down. We do not yet know when we return to normal service. The vendor of the RAID towers has supplied us with a batch of new hard disks. We are in the process of identifying "faulty" disks and replacing them.
It's a tedious process, as we can only replace one disk at a time: When a disk has been replaced, the RAID has to be rebuild, a process which can take up to 6 hours (we can then replace another).
It is not at present known how many disks we have to exchange.
The hardware problems on the RAID systems attached to disk2 are continuing this morning. We have determined that the problems are of a such severe character that we have to cease our attempts to restart the system. We have contacted the manufacturer to inquire about the next appropriate steps.
At this stage we do not expect occurrence of data-loss, but due to instabilities on multiple disks, we need to proceed carefully.
Users having their homedirectories on /home/disk1 should be able to continue using the cluster, as long as they do not require the use of /wkspace2.
We are experiencing severe RAID system problems were disks are dropping out one after the other. After one incident we had to reset the server which crashed the filesystem which is being repaired while this is written.
We do not expect disk2 to be up until tomorrow morning, if all goes well.
This part of Odense was blessed with a power failure. The cluster nodes are not stabilized by any UPS hence all nodes lost power.
A new queue has been created on the cluster. It's called express and should be used for testing purposes only. The 4 frontends are used as execution hosts for this queue. Jobs submitted to this queue can only request 20 minutes of walltime. Since the frontends are used by interactive users, the local /scratch partition is not guaranteed to be as available as on the compute nodes.
A new webpage has been created, which presents a listing of queued jobs based on priority, rather than the chronology of when jobs were submitted to the queues offered by PBS's qstat. The priority based listing is used by MAUI to make scheduling decisions, i.e. the order in which jobs are selected from the queues and allowed to run.
As agreed at the usersmeeting 1/5/2003 there are now limits on the walltime a job can request:
-- 200 hours when submitted to workq
-- 50 hours when submitted to giga Jobs running and queued as of noon 7/5/2003 will not be affected. Jobs submitted to the queues after this point in time are subject to removal. If a job is removed from the queue, the user will be notified by email. In case it is not possible to use the method outlined in PBS multipart jobs for running very long lasting jobs - please contact Frank Jensen to request an exception to the new queue policy.
We are experiencing severe problems with the queuing system - bugs have been exposed in the PBS software. Until the problem with PBS has been fixed there is a cap of 128 nodes per job in place.
All of University of Southern Denmark in Odense lost main power at around 21:02 tonight.
As the computing nodes are not covered by UPS (by decision) all nodes have been restarted.
We are investigating stranges problems with the MAUI scheduling systems which currently does not populate the cluster with jobs in a deterministic manner.
We hope to be on top of this as soon as possible.
We went on-line again today.
Current users have received information on change in use of the cluster. In particular there are some constraints on the new part of the cluster that will be enforced by the scheduling system.
New users can apply for their account.
The Overview page contains information on the new cluster setup.
We have severe stability problems with the disk systems on both disk servers.
There is no estimate on when we can be back online.
We are working on bringing the system online this minute.
Unfortunately the cooling system entrepreneurs still work on the fan coil in the room.
As Claus also needs to be certain that the new version of PBSpro works satisfactory we need some more time.
Please be patient while checking this page.
Unfortunately we will not be able to start this Friday either.
We hope to have the old machines up and running this Friday but we need to tune the queue system.
The new machines still need to be Linux installed but this can wait till after the old system is brought online again.
A cautious guess would be Monday afternoon, but we need to see our families too.
The cooling system entrepreneur finish their installation work Friday but did not have the time to start the compressor and to tune the new air-flow system.
Due to demands from the cooling system entrepreneur we have not had a chance to install the harddisks and configure the BIOS of the new machines which we hope to be able to do Wednesday.
We have had problems with the new disk systems (and partly still have) which is the reason that we are in the stage of transfering user data now at this late stage.
We also planned to reinstall all the old machines but could not do so due to all the dust in the room which forced us to keep as much as possible off line.
Hence we are a long time from being able to start the cluster especially the extension part.
We antipate to be able to let users start their jobs Friday 2003-03-21 at the latest. This might only be on the old system for a start.
This has taken much more time than we had expected mainly due to design changes in the cooling system installation after 6 days downtime but also due to the disk system problems which were solved this Friday.
Changed rebuild schedule (until revised):
  • 2003-03-10: The cooling equipment is not build yet. It is about 50% finished.
  • 2003-03-17: New target date for restart (hopefully before then).
Changed rebuild schedule (until revised):
  • 2003-02-27: Machine will be shut down for work on the cooling system
  • 2003-03-06: The machine will be brought up for benchmarking - this will take 1-2 days.
  • 2003-03-10: This is the target for restarting the machine.
The main reason for the change of schedule is a wish from the company doing the installations needed for the extra cooling.
The application form has been revised to cover the new grant holders.
Applicants with CPU year allocations are listed by name. All other grant holders should use Other.
We are really low on remaining disk space. We have 60 GB left in the home directory store and 23 GB in scratch space.
Until the rebuild you can use /mount/tmp to store data but be warned that /mount/tmp will be erased during rebuild at the end of the month so you have to secure your data from this data store yourself off Horseshoe before February 27th.
Changed rebuild schedule (until revised):
  • 2003-02-27: Machine will be shut down for work on the cooling system
  • 2003-03-04: The machine will be brought up for benchmarking - this will take 1-2 days.
  • 2003-03-06/07: This is the target for restarting the machine.
Current rebuild schedule (until revised):
  • 2003-02-27: Machine will be shut down for work on the cooling system
  • 2003-03-04: This is the target for restarting the machine.
We have now finalized the plans for extending the Horseshoe cluster computer. The extension will consist of 140 nodes (2.66 GHz Pentium-4, 1 Gb DDRRAM, 120 Gb disk) connected by Gb Ethernet. The installation is expected to take place in week 9 (feb 27-mar 3) and will require taking the machine down for at least a couple of days. The exact schedule will depend on when the nodes are delivered, arrival of the extra cooling system, additional installation of power, etc.
In addition to the computing upgrade, the disk system for permanent files will be extended by ~5 TB. We are aware that /home is running 90+% full, but release of more disk space requires taking the server offline, thereby interrupting all running jobs. Since we will be forced to take the machine down within one month anyway, we will postpone the disk repartition until week 9. If any users are in desperate need for more disk space before that, contact admin@dcsc.sdu.dk. Otherwise I urge all users to perform a little diskcleaning, and remove unused files.
We plan a users meeting immediately after installation of the additional nodes, hopefully early march, where details of the installation and guidelines for using the new nodes will be discussed.
Web site ammended with Benchmarks and deep link for statistics.
Cluster-Computing Course October 15 and 16.
Info on course is posted here: Course
Today we will be connecting the Odense and Lyngby installations. We will use the black fiber parallel to the Forskningsnettet to establish a 1 Gbps line connecting a total of 992 nodes together on one single LAN.
While the connection is present we will perform a LINPACK benchmark on both clusters combined.
During the benchmark the PBS queue will be suspended and running jobs paused. The will be resumed after the benchmark.
Today we will shut down the cluster for the last tweaking
Another article has been spotted: Hardware-test.dk.
Furthermore there is an article in the paper version of IT2U from Dansk Metal.
Due to further delayed delivery of the last machines and the server we do not anticipate to be able to start running grant jobs before August 1st. Let us hope that things change.
Article in Computer World and Fyens Stiftstidende (login required)
Grand Press Day
Reporters and camera teams have been visiting us today. Keep an eye on TV2/Fyn (19:30), TV2 and DR1 (18:30) tonight.
Compaq is not particularly clear in their information on arrival of machines.
Until now we have received 120 machines and we expect further 260 machines today even though all machines should have been here Thursday last week (according to Compaq's own information).
Compaq started the build of the last 140 machines Friday last week.
You can now start requesting your accounts using the account application form
Due to a postponed delivery of the machines we now anticipate a start date of July 15
We anticipate to be able to start the whole system around July 1.