Message boards :
Number crunching :
Problem with v4.xx WUs
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 May 10 Posts: 15 Credit: 1,598,733 RAC: 274 |
Hello forum, I have a problem with v4.xx WUs on this host. On Sep 18 I noticed a v4.00 WU running with an elapsed time of more than 44 hours at 100% progress. I aborted it. Yesterday morning I again found a v4.01 WU running more than 3 hours and at 100%. Again, I aborted it. The same happened yesterday night: this v4.01 WU. The new WU started was again a v4.01 WU. This time a thought I would take a closer look at what is happening. The WU started running, and after about 7.5 min it reached 100% and remaining time was zero. However, WU kept running. Right now elapsed time is about 11 hours. I have another host with the same MB (ASUS A8V DeLuxe), same CPU (AMD Opteron 180) and same O/S (Windows 8), which does not have this problem. Yet another host with the same HW, only running Windows XP also does not have this problem. What is going on here? |
Send message Joined: 27 May 10 Posts: 15 Credit: 1,598,733 RAC: 274 |
Hello again, I am still having the same problem on this host, but now it is a v4.03 WU. It is running for 11.5 hours now at 100%. All my other hosts, including the ones I mentioned previously, are running v4.04 WUs, and are doing fine. I will abort this v4.03 WU, so the host will probably get a v4.04 WU, and see what happens. Regards, Ruud |
Send message Joined: 27 May 10 Posts: 15 Credit: 1,598,733 RAC: 274 |
The v4.04 WU ran without a problem. (2 of them have finished now) At the start of the WU I observed some strange behavior: The remaining time decreased (from an unknown, but supposedly low value) slowly to about 30 sec. But after the WU had run for about 10 min. the remaining time started to increase again. In the end we can conclude, that I have no idea what the problem was, but it seems to have been solved. As we say in Dutch: de techniek lost alles op. Which could be translated as: With technology you can solve anything. But it can also be translated as: Technology can solve anything (on its own). Right now I see a v4.08 WU is running. Ruud |
Send message Joined: 20 Jun 12 Posts: 63 Credit: 94,685 RAC: 0 |
WUProp apps are designed to run longer if network is disabled in BOINC Long running tasks give accordingly more credit. - ALF - "Find out what you don't do well ..... then don't do it!" :) |
Send message Joined: 27 May 10 Posts: 15 Credit: 1,598,733 RAC: 274 |
I was a little premature. The v4.04 WUs were running fine, but the new v4.08 WUs show me the same problem, see here. And I am running my systems 24/7, and they are on the network 24/7. |
Send message Joined: 27 May 10 Posts: 15 Credit: 1,598,733 RAC: 274 |
Another update: The last (now aborted) v4.08 WU seemed to run normal until reaching elapsed time of about 2.5 hours. At that point remaining time had gone gracefully to zero and progress was at 100%. However, up until the point where I aborted it (4h20m) it kept running. |
Send message Joined: 20 Jun 12 Posts: 63 Credit: 94,685 RAC: 0 |
Next time don't abort. Instead check if the files are still updating: Go to ...\projects\wuprop.boinc-af.org\ and see if files are updated every minute: cache wu_v4_{some-numbers} Go to ...\slots\ and find the slot for data_collect_v4_4.08_windows_intelx86__nci Check the files: stderr.txt boinc_task_state.xml (and <fraction_done>0.xxxx</fraction_done> inside it) checkpoint (numbers inside it change every minute) - ALF - "Find out what you don't do well ..... then don't do it!" :) |
Send message Joined: 27 May 10 Posts: 15 Credit: 1,598,733 RAC: 274 |
Thank you for your advise. We are talking about this WU on this host. The WU showed the same behaviour as the previous one, and is running for 20+ hours now. I checked the files in ...\projects\wuprop.boinc-af.org\: cache is updated every minute; wu_v4_1379769294_106959_0_0 is not updated every minute, but has a time stamp of about 30 min. ago. I also looked into the ...\slots\ directory: checkpoint is updated every minute; boinc_task_state.xml is updated almost every minute. Sometimes it is skipped for 1 or 2 minutes; stderr.txt has a time stamp of earlier this morning. If I open the file, I see the messages ((4916): facteur correction: 0.500000) are updated every minute, and the last message has the current time stamp. When I close the file, it has the current time stamp. (I think this is one of those strange Windows behaviors). All other files in ...\slots\ have a time stamp of earlier this morning, about an hour ago. (Same as the first message in the stderr.txt file.) Hope this gives you some more information. Thanks again, Ruud |
Send message Joined: 20 Jun 12 Posts: 63 Credit: 94,685 RAC: 0 |
If the files are updated this means the app (data_collect_v4_4.08_windows_intelx86__nci.exe) is not stuck/hang. Copy those files in case the project admin will want to know what is in them. (all are text files and can be opened with Notepad. If they look strange (no line-breaks) try WordPad. I use F3 View in Total Commander) Check the Activity menu for the position of Network activity (try to set this to 'Network activity always available') http://boinc.berkeley.edu/wiki/Advanced_view#BOINC_Manager_Menus Look in Event Log - do you see the following messages to appear every hour?: 24/09/2013 14:59:14 WUProp@Home Sending scheduler request: Requested by project. 24/09/2013 14:59:14 WUProp@Home Not reporting or requesting tasks 24/09/2013 14:59:16 WUProp@Home Scheduler request completed If the Network activity is set correctly I am out of ideas. We may need the project admin to look at this case. - ALF - "Find out what you don't do well ..... then don't do it!" :) |
Send message Joined: 28 Mar 10 Posts: 2869 Credit: 538,367 RAC: 137 |
Thank you for your advise. If the problem occurs again, can you post the content of file named wu_v4_{some-numbers} ? |
Send message Joined: 20 Jun 12 Posts: 63 Credit: 94,685 RAC: 0 |
He Aborted again! http://wuprop.boinc-af.org/result.php?resultid=33135107 Can you save the uploaded wu_v4_1379769294_73193_0 from the server or is it purged already? - ALF - "Find out what you don't do well ..... then don't do it!" :) |
Send message Joined: 17 Mar 13 Posts: 1 Credit: 40,328 RAC: 0 |
I've got another problem. Download = OK and begin to work 16.09.2013 01:25:20 | WUProp@Home | Scheduler request completed: got 1 new tasks 16.09.2013 01:25:22 | WUProp@Home | Started download of data_collect_v4_4.00_windows_intelx86__nci.exe 16.09.2013 01:25:24 | WUProp@Home | Finished download of data_collect_v4_4.00_windows_intelx86__nci.exe 16.09.2013 01:25:24 | WUProp@Home | Starting task wu_v4_1379273943_1285_0 using data_collect_v4 version 400 (nci) in slot 7 28.09.2013 07:19:22 | WUProp@Home | Result wu_v4_1379273943_1285_0 is no longer usable 28.09.2013 07:19:24 | WUProp@Home | Computation for task wu_v4_1379273943_1285_0 finished 28.09.2013 07:19:32 | WUProp@Home | Sending scheduler request: To report completed tasks. Runtime ~294h no minute is counted on my account 28.09.2013 07:19:32 | WUProp@Home | Reporting 1 completed tasks 28.09.2013 07:19:32 | WUProp@Home | Requesting new tasks for CPU 28.09.2013 07:19:37 | WUProp@Home | Scheduler request completed: got 1 new tasks 28.09.2013 07:19:39 | WUProp@Home | Started download of data_collect_v4_4.08_windows_x86_64__nci.exe 28.09.2013 07:19:41 | WUProp@Home | Finished download of data_collect_v4_4.08_windows_x86_64__nci.exe 28.09.2013 07:19:41 | WUProp@Home | Starting task wu_v4_1379769294_270958_0 using data_collect_v4 version 408 (nci) in slot 8 The new task runs now ~180h |
Send message Joined: 20 Jun 12 Posts: 63 Credit: 94,685 RAC: 0 |
What you can do to help is already posted. - ALF - "Find out what you don't do well ..... then don't do it!" :) |
©2024 Sébastien