Checkpointing

log in

Advanced search

Message boards : Number crunching : Checkpointing

Author Message
Thyme Lawn
 
Send message
Joined: 11 Apr 10
Posts: 54
Credit: 382,341
RAC: 0
Total hours: 639,907
Message 238 - Posted: 20 May 2010, 17:13:10 UTC

Data collect 1.25 (NCI) normally checkpoints every 5 minutes on my systems but occasionally it runs for longer periods without checkpointing. The current task on my Q6600 XP system has just had the most extreme example of this behaviour that I've seen so far, running for 5 hours without performing a checkpoint:

20/05/2010 12:50:35 WUProp@Home [checkpoint_debug] result wu_1274213743_6997_0 checkpointed 20/05/2010 17:50:36 WUProp@Home [checkpoint_debug] result wu_1274213743_6997_0 checkpointed

The output file shows that a cycle was still being performed every 5 minutes, with the checkpoints corresponding to cycles 43 and 103:

12:50:35 (3584): cycle43 17:50:35 (3584): cycle103

Profile [AF>WildWildWest] Sebastien
     
Dictator
Avatar
Send message
Joined: 28 Mar 10
Posts: 2678
Credit: 513,759
RAC: 95
Total hours: 1,427,586
Message 239 - Posted: 20 May 2010, 19:48:06 UTC - in response to Message 238.

Application checkpoints every 5 minutes.
The checkpoints are not all reported in the messages tab. Probably because application is not cpu intensive.
Take a look in the content of the file checkpoint, you will remark a modification every 5 minutes
____________

Profile Neil Polson
   
Avatar
Send message
Joined: 20 Apr 10
Posts: 20
Credit: 81,989
RAC: 0
Total hours: 33,181
Message 616 - Posted: 27 Sep 2011, 6:16:44 UTC

It appears that the new app Data collect version 3 v3.25 (nci) is failing to checkpoint properly. On a restart the task starts from the beginning. My windows host states ''Erreur assignation project_name (node project)'' and then ''checkpoint failed File exists''every minute in the </stderr_txt>.
____________


Post to thread

Message boards : Number crunching : Checkpointing


Home | My Account | Message Boards | Results


Copyright © 2024 Sebastien