Message boards :
Number crunching :
Run on tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Apr 10 Posts: 4 Credit: 395,722 RAC: 0 |
All of the tasks on one set of machines seems to have gone well over the 12 hours that is typical recently. All of them are up over 2.5 days of wall time, some of them are up to over 6 days of wall time. BOINC will not upload and report until they stop running. BOINC will not contact the project web site until they stop running. Someplace the internal stop signal seems to have gone missing. |
Send message Joined: 28 Mar 10 Posts: 2869 Credit: 538,385 RAC: 134 |
It seems theere is a problem with version 6.11.x I'm on hollydays, so I can't compile new application. I will release new applications in August. |
Send message Joined: 7 Apr 10 Posts: 1 Credit: 319,330 RAC: 12 |
Hello, I´ve the same problem with 6.11.4 and Win XP (32b). Progress bar is at 0 % all the time. |
Send message Joined: 25 Jul 10 Posts: 13 Credit: 33,946 RAC: 0 |
Sebastien, I would think that as 6.11.x is in beta testing. it might be prudent to continue with the current apps until 6.11.x is released to the public. Designing apps to run on the beta may be a waste of time. Many things will change as the bugs are worked out. I know many users like to keep up with the current upgrades but you can be asking for troubles (as is stated in the disclaimer). I would prefer to have work for my computer and wait til the beta is released to the public to see if any changes are needed. Just my $.02 worth |
Send message Joined: 6 Apr 10 Posts: 41 Credit: 471,539 RAC: 0 |
It seems theere is a problem with version 6.11.x. Or you could stop sending work just to the 6.11.x clients... |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I've just similar problem with DC 2.31 task wu_1295715071_80772_0. Its Elapsed time is 68:25:24 hours, progress 34.482% and does not seem to increment. It is consuming (some) CPU cycles, used 9 CPU seconds (other tasks on the machine used around 23-25 seconds for 100% - seems to correspond to the progress) has network connections open, but does not seem to be willing to go home after the usual 12 hours... Peter |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
No progress in next 15-16 hours... I've tried to abort the task (simplified log, removed new task's download and start): 12:01:01 | task wu_1295715071_80772_0 aborted by user 12:01:01 | [task] task_state=ABORT_PENDING for wu_1295715071_80772_0 from abort_task 12:01:01 | [sched_op] Reason: Unrecoverable error for task wu_1295715071_80772_0 (aborted by user) 12:01:01 | [task] result state=COMPUTE_ERROR for wu_1295715071_80772_0 from CS::report_result_error 12:01:01 | [task] result state=ABORTED for wu_1295715071_80772_0 from abort_task Upon updating the project, something went wrong: 12:01:28 | update requested by user 12:01:31 | Sending scheduler request: Requested by user. 12:01:31 | Reporting 1 completed tasks 12:01:33 | [sched_op] Server version 611 12:01:33 | [sched_op] handle_scheduler_reply(): got ack for task wu_1295715071_80772_0 12:01:33 | [error] garbage_collect(); still have active task for acked result wu_1295715071_80772_0; state 5 12:01:33 | [task] task_state=ABORTED for wu_1295715071_80772_0 from abort_task 12:01:33 | [task] result state=ABORTED for wu_1295715071_80772_0 from abort_task 12:01:34 | Computation for task wu_1295715071_80772_0 finished 12:01:34 | [task] result state=COMPUTE_ERROR for wu_1295715071_80772_0 from CS::app_finished 12:01:37 | Started upload of wu_1295715071_80772_0_0 12:01:38 | [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_80772 12:01:38 | [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_80772_0_0 The task data still exists. I've found out that it could not be deleted, because its process also still exists and has opened handles on the files. But they were already gone from client_state.xml, so it was enough to kill the process and delete the files. Just a reboot would heal it without manual intervention :-( Peter |
Send message Joined: 6 Oct 10 Posts: 1 Credit: 113,850 RAC: 0 |
Same for me. I had a task going for over 34 hours that I aborted. 2/4/2011 11:27:22 AM WUProp@Home task wu_1295715071_102159_0 aborted by user 2/4/2011 11:27:43 AM WUProp@Home update requested by user 2/4/2011 11:27:47 AM WUProp@Home Sending scheduler request: Requested by user. 2/4/2011 11:27:47 AM WUProp@Home Reporting 1 completed tasks, requesting new tasks for CPU 2/4/2011 11:27:51 AM WUProp@Home Scheduler request completed: got 1 new tasks 2/4/2011 11:27:51 AM WUProp@Home [error] garbage_collect(); still have active task for acked result wu_1295715071_102159_0; state 5 2/4/2011 11:27:52 AM WUProp@Home Computation for task wu_1295715071_102159_0 finished 2/4/2011 11:27:54 AM WUProp@Home Started download of wu_1295715071_115172 2/4/2011 11:27:55 AM WUProp@Home Finished download of wu_1295715071_115172 2/4/2011 11:27:55 AM WUProp@Home Started upload of wu_1295715071_102159_0_0 2/4/2011 11:27:55 AM WUProp@Home Starting wu_1295715071_115172_0 2/4/2011 11:27:55 AM WUProp@Home Starting task wu_1295715071_115172_0 using data_collect version 231 2/4/2011 11:27:56 AM WUProp@Home Finished upload of wu_1295715071_102159_0_0 2/4/2011 11:28:02 AM WUProp@Home [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_102159_0_0 2/4/2011 11:28:07 AM WUProp@Home [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_102159 |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
A few minutes ago my system was so busy, that all BOINC tasks restarted. Well, nearly all - a DC 2.31 task wu_1295715071_124441_0 (I can not add its URL /wuprop.boinc-af.org/result.php?resultid=2048362 because of Akismet kicking in) stayed. Then I've noticed it is already more than two days old and does not progress at all (at 53.793%). After being started on 6.2. 0:03:55, its last sign of life happened on 6.02.2011 6:29:12. And nothing more, no use of CPU cycles or context switches... 06.02.2011 0:03:55 | WUProp@Home | Starting task wu_1295715071_124441_0 using data_collect version 231 06.02.2011 0:03:58 | WUProp@Home | [task] result wu_1295715071_124441_0 checkpointed ...... 06.02.2011 6:29:12 | WUProp@Home | [task] result wu_1295715071_124441_0 checkpointed I had to kill and restart it manually. Peter |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
One more DC 2.31 task (wu_1295715071_147548_0 (I can not add its URL /wuprop.boinc-af.org/result.php?resultid=2072478 because of Akismet)), that lost its sense of time. More than one day old and does not progress (starved at 43.448%). 08.02.2011 15:36:53 | WUProp@Home | Starting task wu_1295715071_147548_0 using data_collect version 231 08.02.2011 15:36:57 | WUProp@Home | [task] result wu_1295715071_147548_0 checkpointed ...... 08.02.2011 20:47:01 | WUProp@Home | [task] result wu_1295715071_147548_0 checkpointed Killed. It seems to me that it happens more than "occasionally"... Peter |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I've noticed that when I abort such "sleepy" task, their output merely tells "aborted by user". But I've killed the last two tasks' processes. Their existing output remains untouched... voilà ! The yesterday's one: wu_1295715071_124441_0 Workunit 1994204 Task 2048362 wrote:
and the today's one: wu_1295715071_147548_0 Workunit 2017311 Task 2072478 wrote:
In both cases the code apparently crashed at the same instruction. Devs?? Peter |
Send message Joined: 28 Mar 10 Posts: 2869 Credit: 538,385 RAC: 134 |
I released a new application which should solve the problem. |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I released a new application which should solve the problem. Thanks, I'm looking forward! I'll keep it in eye ,-) Peter |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I released a new application which should solve the problem. At the evening, after your post, I've noticed my other host has already grabbed the newer 2.33, but the host where I've been observing the problems still had a 2.31 task, which had to lock up (at 79.310%, 20:27:40 elapsed) to tell me its farewell :-D OK, I've thrown it away and grabbed a shiny new 2.33 one, I'm wishing it all best... Peter |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I released a new application which should solve the problem. Unfortunately... no, it does not have to seem having helped much, the 2.33 task locked up too (at 51.724%, 25:40:31 elapsed): wu_1295715071_147548_0 Workunit 2017311 Task 2072478 wrote:
Note that the reason for AccVio seems to be different now... Peter |
Send message Joined: 28 Mar 10 Posts: 2869 Credit: 538,385 RAC: 134 |
I released version 2.34. The workunit should finish when the error "erreur chargement file_transfer" occurs. |
Send message Joined: 28 Mar 10 Posts: 12 Credit: 333,647 RAC: 45 |
My PC can access the internet only once a day. Is it possible to add proj preference how long WUs should run? I.e. 6, 12, 18, and 24 hours. |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I released version 2.34. Possibly... But what about "Start-end tags mismatch"? The 2.34 task locked up too (at 43.448%, now 46:43:48 elapsed): wu_1297456788_5943_0 Workunit 2051979 Task 2108419 wrote:
Note that the reason for AccVio seems to match the 2.33 version. Peter |
Send message Joined: 10 May 10 Posts: 15 Credit: 55,797 RAC: 0 |
I released version 2.34. Now a 2.35 locked-up task (at 11.034%, 10:09:11 elapsed). The <stderr_txt> seems to be terminated by a large block of character data (an unterminated list of other projects' transfers, from <boinc_gui_rpc_reply>, some unterminated string or memory overflow just at the crash? (The block seems to be just a bit more than 4 kB - the usual file transfer block...) wu_1297456788_26894_0 Workunit 2072930 Task 2130214 wrote:
The reason for AccVio still does match the 2.33 and 2.34 versions. (Because of Akismet, I had to modify all URLs.) Peter |
©2024 Sébastien