Message boards :
Number crunching :
Checkpointing
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Apr 10 Posts: 54 Credit: 382,341 RAC: 0 |
Data collect 1.25 (NCI) normally checkpoints every 5 minutes on my systems but occasionally it runs for longer periods without checkpointing. The current task on my Q6600 XP system has just had the most extreme example of this behaviour that I've seen so far, running for 5 hours without performing a checkpoint: 20/05/2010 12:50:35 WUProp@Home [checkpoint_debug] result wu_1274213743_6997_0 checkpointed 20/05/2010 17:50:36 WUProp@Home [checkpoint_debug] result wu_1274213743_6997_0 checkpointed The output file shows that a cycle was still being performed every 5 minutes, with the checkpoints corresponding to cycles 43 and 103: 12:50:35 (3584): cycle43 17:50:35 (3584): cycle103 |
Send message Joined: 28 Mar 10 Posts: 2869 Credit: 538,363 RAC: 138 |
Application checkpoints every 5 minutes. The checkpoints are not all reported in the messages tab. Probably because application is not cpu intensive. Take a look in the content of the file checkpoint, you will remark a modification every 5 minutes |
Send message Joined: 20 Apr 10 Posts: 20 Credit: 81,989 RAC: 0 |
It appears that the new app Data collect version 3 v3.25 (nci) is failing to checkpoint properly. On a restart the task starts from the beginning. My windows host states ''Erreur assignation project_name (node project)'' and then ''checkpoint failed File exists''every minute in the </stderr_txt>. |
©2024 Sébastien