Problem with v4.xx WUs

log in

Advanced search

Message boards : Number crunching : Problem with v4.xx WUs

Author Message
Ruud van der Kroef
 
Send message
Joined: 27 May 10
Posts: 15
Credit: 1,543,577
RAC: 317
Total hours: 2,758,011
Message 1685 - Posted: 21 Sep 2013, 9:31:19 UTC

Hello forum,

I have a problem with v4.xx WUs on this host.
On Sep 18 I noticed a v4.00 WU running with an elapsed time of more than 44 hours at 100% progress. I aborted it.
Yesterday morning I again found a v4.01 WU running more than 3 hours and at 100%. Again, I aborted it. The same happened yesterday night: this v4.01 WU. The new WU started was again a v4.01 WU.
This time a thought I would take a closer look at what is happening.
The WU started running, and after about 7.5 min it reached 100% and remaining time was zero. However, WU kept running. Right now elapsed time is about 11 hours.
I have another host with the same MB (ASUS A8V DeLuxe), same CPU (AMD Opteron 180) and same O/S (Windows 8), which does not have this problem.
Yet another host with the same HW, only running Windows XP also does not have this problem.
What is going on here?

Ruud van der Kroef
 
Send message
Joined: 27 May 10
Posts: 15
Credit: 1,543,577
RAC: 317
Total hours: 2,758,011
Message 1701 - Posted: 22 Sep 2013, 8:18:53 UTC

Hello again,

I am still having the same problem on this host, but now it is a v4.03 WU.
It is running for 11.5 hours now at 100%. All my other hosts, including the ones I mentioned previously, are running v4.04 WUs, and are doing fine.
I will abort this v4.03 WU, so the host will probably get a v4.04 WU, and see what happens.

Regards,
Ruud

Ruud van der Kroef
 
Send message
Joined: 27 May 10
Posts: 15
Credit: 1,543,577
RAC: 317
Total hours: 2,758,011
Message 1702 - Posted: 22 Sep 2013, 14:30:23 UTC

The v4.04 WU ran without a problem. (2 of them have finished now)
At the start of the WU I observed some strange behavior: The remaining time decreased (from an unknown, but supposedly low value) slowly to about 30 sec.
But after the WU had run for about 10 min. the remaining time started to increase again.
In the end we can conclude, that I have no idea what the problem was, but it seems to have been solved.
As we say in Dutch: de techniek lost alles op. Which could be translated as: With technology you can solve anything. But it can also be translated as: Technology can solve anything (on its own).
Right now I see a v4.08 WU is running.

Ruud

Profile BilBg
Avatar
Send message
Joined: 20 Jun 12
Posts: 63
Credit: 94,685
RAC: 0
Total hours: 108,788
Message 1704 - Posted: 23 Sep 2013, 3:19:16 UTC - in response to Message 1685.


WUProp apps are designed to run longer if network is disabled in BOINC
Long running tasks give accordingly more credit.


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Ruud van der Kroef
 
Send message
Joined: 27 May 10
Posts: 15
Credit: 1,543,577
RAC: 317
Total hours: 2,758,011
Message 1705 - Posted: 23 Sep 2013, 5:55:48 UTC

I was a little premature.
The v4.04 WUs were running fine, but the new v4.08 WUs show me the same problem, see here.

And I am running my systems 24/7, and they are on the network 24/7.

Ruud van der Kroef
 
Send message
Joined: 27 May 10
Posts: 15
Credit: 1,543,577
RAC: 317
Total hours: 2,758,011
Message 1706 - Posted: 23 Sep 2013, 10:13:52 UTC

Another update:
The last (now aborted) v4.08 WU seemed to run normal until reaching elapsed time of about 2.5 hours.
At that point remaining time had gone gracefully to zero and progress was at 100%.
However, up until the point where I aborted it (4h20m) it kept running.

Profile BilBg
Avatar
Send message
Joined: 20 Jun 12
Posts: 63
Credit: 94,685
RAC: 0
Total hours: 108,788
Message 1708 - Posted: 23 Sep 2013, 19:49:52 UTC - in response to Message 1706.


Next time don't abort.
Instead check if the files are still updating:

Go to ...\projects\wuprop.boinc-af.org\
and see if files are updated every minute:
cache
wu_v4_{some-numbers}

Go to ...\slots\ and find the slot for data_collect_v4_4.08_windows_intelx86__nci
Check the files:
stderr.txt
boinc_task_state.xml (and <fraction_done>0.xxxx</fraction_done> inside it)
checkpoint (numbers inside it change every minute)


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Ruud van der Kroef
 
Send message
Joined: 27 May 10
Posts: 15
Credit: 1,543,577
RAC: 317
Total hours: 2,758,011
Message 1709 - Posted: 24 Sep 2013, 7:14:06 UTC - in response to Message 1708.

Thank you for your advise.

We are talking about this WU on this host. The WU showed the same behaviour as the previous one, and is running for 20+ hours now.

I checked the files in ...\projects\wuprop.boinc-af.org\:
cache is updated every minute;
wu_v4_1379769294_106959_0_0 is not updated every minute, but has a time stamp of about 30 min. ago.

I also looked into the ...\slots\ directory:
checkpoint is updated every minute;
boinc_task_state.xml is updated almost every minute. Sometimes it is skipped for 1 or 2 minutes;
stderr.txt has a time stamp of earlier this morning. If I open the file, I see the messages ((4916): facteur correction: 0.500000) are updated every minute, and the last message has the current time stamp. When I close the file, it has the current time stamp. (I think this is one of those strange Windows behaviors).
All other files in ...\slots\ have a time stamp of earlier this morning, about an hour ago. (Same as the first message in the stderr.txt file.)

Hope this gives you some more information.
Thanks again,
Ruud

Profile BilBg
Avatar
Send message
Joined: 20 Jun 12
Posts: 63
Credit: 94,685
RAC: 0
Total hours: 108,788
Message 1710 - Posted: 24 Sep 2013, 13:02:27 UTC - in response to Message 1709.


If the files are updated this means the app (data_collect_v4_4.08_windows_intelx86__nci.exe) is not stuck/hang.

Copy those files in case the project admin will want to know what is in them.
(all are text files and can be opened with Notepad. If they look strange (no line-breaks) try WordPad. I use F3 View in Total Commander)


Check the Activity menu for the position of Network activity (try to set this to 'Network activity always available')
http://boinc.berkeley.edu/wiki/Advanced_view#BOINC_Manager_Menus

Look in Event Log - do you see the following messages to appear every hour?:
24/09/2013 14:59:14 WUProp@Home Sending scheduler request: Requested by project.
24/09/2013 14:59:14 WUProp@Home Not reporting or requesting tasks
24/09/2013 14:59:16 WUProp@Home Scheduler request completed


If the Network activity is set correctly I am out of ideas.
We may need the project admin to look at this case.


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Profile [AF>WildWildWest] Sebastien
     
Dictator
Avatar
Send message
Joined: 28 Mar 10
Posts: 2692
Credit: 516,927
RAC: 94
Total hours: 1,446,872
Message 1711 - Posted: 24 Sep 2013, 16:47:19 UTC - in response to Message 1709.

Thank you for your advise.

We are talking about this WU on this host. The WU showed the same behaviour as the previous one, and is running for 20+ hours now.

I checked the files in ...\projects\wuprop.boinc-af.org\:
cache is updated every minute;
wu_v4_1379769294_106959_0_0 is not updated every minute, but has a time stamp of about 30 min. ago.

I also looked into the ...\slots\ directory:
checkpoint is updated every minute;
boinc_task_state.xml is updated almost every minute. Sometimes it is skipped for 1 or 2 minutes;
stderr.txt has a time stamp of earlier this morning. If I open the file, I see the messages ((4916): facteur correction: 0.500000) are updated every minute, and the last message has the current time stamp. When I close the file, it has the current time stamp. (I think this is one of those strange Windows behaviors).
All other files in ...\slots\ have a time stamp of earlier this morning, about an hour ago. (Same as the first message in the stderr.txt file.)

Hope this gives you some more information.
Thanks again,
Ruud


If the problem occurs again, can you post the content of file named wu_v4_{some-numbers} ?

____________

Profile BilBg
Avatar
Send message
Joined: 20 Jun 12
Posts: 63
Credit: 94,685
RAC: 0
Total hours: 108,788
Message 1713 - Posted: 25 Sep 2013, 20:13:46 UTC - in response to Message 1711.


He Aborted again!
http://wuprop.boinc-af.org/result.php?resultid=33135107

Can you save the uploaded wu_v4_1379769294_73193_0 from the server or is it purged already?


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

p3d-cluster
 
Send message
Joined: 17 Mar 13
Posts: 1
Credit: 40,328
RAC: 0
Total hours: 135,507
Message 1733 - Posted: 5 Oct 2013, 17:34:14 UTC

I've got another problem.
Download = OK and begin to work
16.09.2013 01:25:20 | WUProp@Home | Scheduler request completed: got 1 new tasks
16.09.2013 01:25:22 | WUProp@Home | Started download of data_collect_v4_4.00_windows_intelx86__nci.exe
16.09.2013 01:25:24 | WUProp@Home | Finished download of data_collect_v4_4.00_windows_intelx86__nci.exe
16.09.2013 01:25:24 | WUProp@Home | Starting task wu_v4_1379273943_1285_0 using data_collect_v4 version 400 (nci) in slot 7

28.09.2013 07:19:22 | WUProp@Home | Result wu_v4_1379273943_1285_0 is no longer usable
28.09.2013 07:19:24 | WUProp@Home | Computation for task wu_v4_1379273943_1285_0 finished
28.09.2013 07:19:32 | WUProp@Home | Sending scheduler request: To report completed tasks.

Runtime ~294h
no minute is counted on my account


28.09.2013 07:19:32 | WUProp@Home | Reporting 1 completed tasks
28.09.2013 07:19:32 | WUProp@Home | Requesting new tasks for CPU
28.09.2013 07:19:37 | WUProp@Home | Scheduler request completed: got 1 new tasks
28.09.2013 07:19:39 | WUProp@Home | Started download of data_collect_v4_4.08_windows_x86_64__nci.exe
28.09.2013 07:19:41 | WUProp@Home | Finished download of data_collect_v4_4.08_windows_x86_64__nci.exe
28.09.2013 07:19:41 | WUProp@Home | Starting task wu_v4_1379769294_270958_0 using data_collect_v4 version 408 (nci) in slot 8

The new task runs now ~180h

Profile BilBg
Avatar
Send message
Joined: 20 Jun 12
Posts: 63
Credit: 94,685
RAC: 0
Total hours: 108,788
Message 1738 - Posted: 6 Oct 2013, 5:38:28 UTC - in response to Message 1733.


What you can do to help is already posted.


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)


Post to thread

Message boards : Number crunching : Problem with v4.xx WUs


Home | My Account | Message Boards | Results


Copyright © 2024 Sebastien