Checkpointing

Message boards : Number crunching : Checkpointing
Message board moderation

To post messages, you must log in.

AuthorMessage
Thyme Lawn
 

Send message
Joined: 11 Apr 10
Posts: 54
Credit: 382,341
RAC: 0
Message 238 - Posted: 20 May 2010, 17:13:10 UTC

Data collect 1.25 (NCI) normally checkpoints every 5 minutes on my systems but occasionally it runs for longer periods without checkpointing. The current task on my Q6600 XP system has just had the most extreme example of this behaviour that I've seen so far, running for 5 hours without performing a checkpoint:

20/05/2010 12:50:35	WUProp@Home	[checkpoint_debug] result wu_1274213743_6997_0 checkpointed
20/05/2010 17:50:36	WUProp@Home	[checkpoint_debug] result wu_1274213743_6997_0 checkpointed

The output file shows that a cycle was still being performed every 5 minutes, with the checkpoints corresponding to cycles 43 and 103:

12:50:35 (3584): cycle43
17:50:35 (3584): cycle103
ID: 238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>WildWildWest] Sébastie...
     
Project administrator
Avatar

Send message
Joined: 28 Mar 10
Posts: 2869
Credit: 538,377
RAC: 135
Message 239 - Posted: 20 May 2010, 19:48:06 UTC - in response to Message 238.  

Application checkpoints every 5 minutes.
The checkpoints are not all reported in the messages tab. Probably because application is not cpu intensive.
Take a look in the content of the file checkpoint, you will remark a modification every 5 minutes
ID: 239 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Neil Polson
   
Avatar

Send message
Joined: 20 Apr 10
Posts: 20
Credit: 81,989
RAC: 0
Message 616 - Posted: 27 Sep 2011, 6:16:44 UTC

It appears that the new app Data collect version 3 v3.25 (nci) is failing to checkpoint properly. On a restart the task starts from the beginning. My windows host states ''Erreur assignation project_name (node project)'' and then ''checkpoint failed File exists''every minute in the </stderr_txt>.
ID: 616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Checkpointing

©2024 Sébastien