Author |
Message |
|
All of the tasks on one set of machines seems to have gone well over the 12 hours that is typical recently. All of them are up over 2.5 days of wall time, some of them are up to over 6 days of wall time.
BOINC will not upload and report until they stop running. BOINC will not contact the project web site until they stop running. Someplace the internal stop signal seems to have gone missing. |
|
|
|
It seems theere is a problem with version 6.11.x
I'm on hollydays, so I can't compile new application.
I will release new applications in August.
____________
|
|
|
|
Hello, I´ve the same problem with 6.11.4 and Win XP (32b). Progress bar is at 0 % all the time. |
|
|
|
Sebastien,
I would think that as 6.11.x is in beta testing. it might be prudent to continue with the current apps until 6.11.x is released to the public. Designing apps to run on the beta may be a waste of time. Many things will change as the bugs are worked out. I know many users like to keep up with the current upgrades but you can be asking for troubles (as is stated in the disclaimer). I would prefer to have work for my computer and wait til the beta is released to the public to see if any changes are needed. Just my $.02 worth |
|
|
|
It seems theere is a problem with version 6.11.x.
Or you could stop sending work just to the 6.11.x clients...
____________
|
|
|
|
I've just similar problem with DC 2.31 task wu_1295715071_80772_0. Its Elapsed time is 68:25:24 hours, progress 34.482% and does not seem to increment. It is consuming (some) CPU cycles, used 9 CPU seconds (other tasks on the machine used around 23-25 seconds for 100% - seems to correspond to the progress) has network connections open, but does not seem to be willing to go home after the usual 12 hours...
Peter |
|
|
|
No progress in next 15-16 hours... I've tried to abort the task (simplified log, removed new task's download and start):
12:01:01 | task wu_1295715071_80772_0 aborted by user
12:01:01 | [task] task_state=ABORT_PENDING for wu_1295715071_80772_0 from abort_task
12:01:01 | [sched_op] Reason: Unrecoverable error for task wu_1295715071_80772_0 (aborted by user)
12:01:01 | [task] result state=COMPUTE_ERROR for wu_1295715071_80772_0 from CS::report_result_error
12:01:01 | [task] result state=ABORTED for wu_1295715071_80772_0 from abort_task
Upon updating the project, something went wrong:
12:01:28 | update requested by user
12:01:31 | Sending scheduler request: Requested by user.
12:01:31 | Reporting 1 completed tasks
12:01:33 | [sched_op] Server version 611
12:01:33 | [sched_op] handle_scheduler_reply(): got ack for task wu_1295715071_80772_0
12:01:33 | [error] garbage_collect(); still have active task for acked result wu_1295715071_80772_0; state 5
12:01:33 | [task] task_state=ABORTED for wu_1295715071_80772_0 from abort_task
12:01:33 | [task] result state=ABORTED for wu_1295715071_80772_0 from abort_task
12:01:34 | Computation for task wu_1295715071_80772_0 finished
12:01:34 | [task] result state=COMPUTE_ERROR for wu_1295715071_80772_0 from CS::app_finished
12:01:37 | Started upload of wu_1295715071_80772_0_0
12:01:38 | [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_80772
12:01:38 | [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_80772_0_0
The task data still exists. I've found out that it could not be deleted, because its process also still exists and has opened handles on the files. But they were already gone from client_state.xml, so it was enough to kill the process and delete the files.
Just a reboot would heal it without manual intervention :-(
Peter |
|
|
|
Same for me. I had a task going for over 34 hours that I aborted.
2/4/2011 11:27:22 AM WUProp@Home task wu_1295715071_102159_0 aborted by user
2/4/2011 11:27:43 AM WUProp@Home update requested by user
2/4/2011 11:27:47 AM WUProp@Home Sending scheduler request: Requested by user.
2/4/2011 11:27:47 AM WUProp@Home Reporting 1 completed tasks, requesting new tasks for CPU
2/4/2011 11:27:51 AM WUProp@Home Scheduler request completed: got 1 new tasks
2/4/2011 11:27:51 AM WUProp@Home [error] garbage_collect(); still have active task for acked result wu_1295715071_102159_0; state 5
2/4/2011 11:27:52 AM WUProp@Home Computation for task wu_1295715071_102159_0 finished
2/4/2011 11:27:54 AM WUProp@Home Started download of wu_1295715071_115172
2/4/2011 11:27:55 AM WUProp@Home Finished download of wu_1295715071_115172
2/4/2011 11:27:55 AM WUProp@Home Started upload of wu_1295715071_102159_0_0
2/4/2011 11:27:55 AM WUProp@Home Starting wu_1295715071_115172_0
2/4/2011 11:27:55 AM WUProp@Home Starting task wu_1295715071_115172_0 using data_collect version 231
2/4/2011 11:27:56 AM WUProp@Home Finished upload of wu_1295715071_102159_0_0
2/4/2011 11:28:02 AM WUProp@Home [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_102159_0_0
2/4/2011 11:28:07 AM WUProp@Home [error] Couldn't delete file projects/wuprop.boinc-af.org/wu_1295715071_102159
|
|
|
|
A few minutes ago my system was so busy, that all BOINC tasks restarted. Well, nearly all - a DC 2.31 task wu_1295715071_124441_0 (I can not add its URL /wuprop.boinc-af.org/result.php?resultid=2048362 because of Akismet kicking in) stayed. Then I've noticed it is already more than two days old and does not progress at all (at 53.793%).
After being started on 6.2. 0:03:55, its last sign of life happened on 6.02.2011 6:29:12. And nothing more, no use of CPU cycles or context switches...
06.02.2011 0:03:55 | WUProp@Home | Starting task wu_1295715071_124441_0 using data_collect version 231
06.02.2011 0:03:58 | WUProp@Home | [task] result wu_1295715071_124441_0 checkpointed
......
06.02.2011 6:29:12 | WUProp@Home | [task] result wu_1295715071_124441_0 checkpointed
I had to kill and restart it manually.
Peter |
|
|
|
One more DC 2.31 task (wu_1295715071_147548_0 (I can not add its URL /wuprop.boinc-af.org/result.php?resultid=2072478 because of Akismet)), that lost its sense of time. More than one day old and does not progress (starved at 43.448%).
08.02.2011 15:36:53 | WUProp@Home | Starting task wu_1295715071_147548_0 using data_collect version 231
08.02.2011 15:36:57 | WUProp@Home | [task] result wu_1295715071_147548_0 checkpointed
......
08.02.2011 20:47:01 | WUProp@Home | [task] result wu_1295715071_147548_0 checkpointed
Killed.
It seems to me that it happens more than "occasionally"...
Peter |
|
|
|
I've noticed that when I abort such "sleepy" task, their output merely tells "aborted by user".
But I've killed the last two tasks' processes. Their existing output remains untouched... voilà !
The yesterday's one:
wu_1295715071_124441_0 Workunit 1994204 Task 2048362 wrote:
06.02.2011 0:03:55 | WUProp@Home | Starting task wu_1295715071_124441_0 using data_collect version 231
06.02.2011 0:03:58 | WUProp@Home | [task] result wu_1295715071_124441_0 checkpointed
......
06.02.2011 6:29:12 | WUProp@Home | [task] result wu_1295715071_124441_0 checkpointed
A hard crash a few seconds later:
<message>
Nesprávna funkcia. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Initialisation
00:03:56 (16912): connection socket active_results
00:03:56 (16912): connection socket state
00:03:56 (16912): connection socket file_transfer
06:29:10 (16912): Timeout reception.
06:29:10 (16912): data incomplete.
06:29:21 (16912): erreur chargement file_transfer
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004373C1 read attempt to address 0x05BE8A25
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
and the today's one:
wu_1295715071_147548_0 Workunit 2017311 Task 2072478 wrote:
08.02.2011 15:36:53 | WUProp@Home | Starting task wu_1295715071_147548_0 using data_collect version 231
08.02.2011 15:36:57 | WUProp@Home | [task] result wu_1295715071_147548_0 checkpointed
......
08.02.2011 20:47:01 | WUProp@Home | [task] result wu_1295715071_147548_0 checkpointed
A hard crash a few minutes later:
<message>
Nesprávna funkcia. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Initialisation
15:36:54 (18752): connection socket active_results
15:36:54 (18752): connection socket state
15:36:54 (18752): connection socket file_transfer
16:54:21 (18752): Timeout reception.
16:54:21 (18752): data incomplete.
18:47:16 (18752): Timeout reception.
18:47:16 (18752): data incomplete.
20:40:27 (18752): Timeout reception.
20:40:27 (18752): data incomplete.
20:49:39 (18752): Timeout reception.
20:49:39 (18752): data incomplete.
20:50:34 (18752): erreur chargement file_transfer
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004373C1 read attempt to address 0x89EE6166
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
In both cases the code apparently crashed at the same instruction. Devs??
Peter |
|
|
|
I released a new application which should solve the problem.
____________
|
|
|
|
I released a new application which should solve the problem.
Thanks, I'm looking forward!
I'll keep it in eye ,-)
Peter |
|
|
|
I released a new application which should solve the problem.
Thanks, I'm looking forward!
I'll keep it in eye ,-)
At the evening, after your post, I've noticed my other host has already grabbed the newer 2.33, but the host where I've been observing the problems still had a 2.31 task, which had to lock up (at 79.310%, 20:27:40 elapsed) to tell me its farewell :-D
OK, I've thrown it away and grabbed a shiny new 2.33 one, I'm wishing it all best...
Peter |
|
|
|
I released a new application which should solve the problem.
Thanks, I'm looking forward!
I'll keep it in eye ,-)
OK, I've [...] grabbed a shiny new 2.33 one, I'm wishing it all best...
Unfortunately... no, it does not have to seem having helped much, the 2.33 task locked up too (at 51.724%, 25:40:31 elapsed):
wu_1295715071_147548_0 Workunit 2017311 Task 2072478 wrote:
10.02.2011 12:47:55 | WUProp@Home | Starting task wu_1295715071_163902_0 using data_collect version 233
10.02.2011 12:47:57 | WUProp@Home | [task] result wu_1295715071_163902_0 checkpointed
......
10.02.2011 18:58:57 | WUProp@Home | [task] result wu_1295715071_163902_0 checkpointed
A hard crash a few minutes later:
<message>
Nesprávna funkcia. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Initialisation
12:47:56 (16420): connection socket active_results
12:47:56 (16420): connection socket state
12:47:56 (16420): connection socket file_transfer
12:47:56 (16420): connection socket host_info
18:53:55 (16420): Timeout reception.
18:53:55 (16420): data incomplete.
19:01:49 (16420): erreur chargement file_transfer Start-end tags mismatch
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004209CD read attempt to address 0x160FA9D2
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
Note that the reason for AccVio seems to be different now...
Peter |
|
|
|
I released version 2.34.
The workunit should finish when the error "erreur chargement file_transfer" occurs.
____________
|
|
|
|
My PC can access the internet only once a day.
Is it possible to add proj preference how long WUs should run? I.e. 6, 12, 18, and 24 hours.
|
|
|
|
I released version 2.34.
The workunit should finish when the error "erreur chargement file_transfer" occurs.
Possibly... But what about "Start-end tags mismatch"? The 2.34 task locked up too (at 43.448%, now 46:43:48 elapsed):
wu_1297456788_5943_0 Workunit 2051979 Task 2108419 wrote:
12.02.2011 14:38:45 | WUProp@Home | Starting task wu_1297456788_5943_0 using data_collect version 234
12.02.2011 14:38:48 | WUProp@Home | [task] result wu_1297456788_5943_0 checkpointed
......
12.02.2011 19:48:51 | WUProp@Home | [task] result wu_1297456788_5943_0 checkpointed
A hard crash a few minutes later:
<message>
Nesprávna funkcia. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Initialisation
14:38:46 (22488): connection socket active_results
14:38:46 (22488): connection socket state
14:38:46 (22488): connection socket file_transfer
14:38:46 (22488): connection socket host_info
16:01:57 (22488): Timeout reception.
16:01:57 (22488): data incomplete.
19:49:47 (22488): Timeout reception.
19:49:47 (22488): data incomplete.
19:49:51 (22488): erreur chargement file_transfer Start-end tags mismatch
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004209CD read attempt to address 0xA2100065
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
Note that the reason for AccVio seems to match the 2.33 version.
Peter |
|
|
|
I released version 2.34.
The workunit should finish when the error "erreur chargement file_transfer" occurs.
Possibly... But what about "Start-end tags mismatch"? The 2.34 task locked up too (at 43.448%, now 46:43:48 elapsed).
Now a 2.35 locked-up task (at 11.034%, 10:09:11 elapsed). The <stderr_txt> seems to be terminated by a large block of character data (an unterminated list of other projects' transfers, from <boinc_gui_rpc_reply>, some unterminated string or memory overflow just at the crash? (The block seems to be just a bit more than 4 kB - the usual file transfer block...)
wu_1297456788_26894_0 Workunit 2072930 Task 2130214 wrote:
15.02.2011 1:25:47 | WUProp@Home | Starting task wu_1297456788_26894_0 using data_collect version 235
15.02.2011 1:26:03 | WUProp@Home | [task] result wu_1297456788_26894_0 checkpointed
15.02.2011 2:36:08 | WUProp@Home | [task] result wu_1297456788_26894_0 checkpointed
......
15.02.2011 2:41:10 | WUProp@Home | [task] result wu_1297456788_26894_0 checkpointed
A hard crash a few seconds later:
<message>
Nesprávna funkcia. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Initialisation
01:26:02 (25160): connection socket active_results
01:26:02 (25160): connection socket state
01:26:02 (25160): connection socket file_transfer
01:26:02 (25160): connection socket host_info
02:41:06 (25160): Timeout reception.
02:41:06 (25160): data incomplete.
02:41:35 (25160): erreur chargement file_transfer Start-end tags mismatch <boinc_gui_rpc_reply>
<file_transfers>
<file_transfer>
<project_url>http.//setiathome.berkeley.edu/</project_url>
<project_name>SETI@home</project_name>
<name>31my10ab.19370.391277.14.10.150</name>
<nbytes>375337.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>0</status>
<persistent_file_xfer>
<num_retries>44</num_retries>
<first_request_time>1296830886.620766</first_request_time>
<next_request_time>1297772809.959487</next_request_time>
<time_so_far>317.368349</time_so_far>
<last_bytes_xferred>52397.000000</last_bytes_xferred>
</persistent_file_xfer>
</file_transfer>
<file_transfer>
<project_url>http.//setiweb.ssl.berkeley.edu/beta/</project_url>
<project_name>SETI@home Beta Test</project_name>
<name>ap_06no10ad_B3_P1_00144_20110123_03759.wu</name>
<nbytes>8392046.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>0</status>
<persistent_file_xfer>
<num_retries>39</num_retries>
<first_request_time>1296921440.453398</first_request_time>
<next_request_time>1297734925.966553</next_request_time>
<time_so_far>313.892300</time_so_far>
<last_bytes_xferred>552109.000000</last_bytes_xferred>
</persistent_file_xfer>
</file_transfer>
<file_transfer>
<project_url>http.//www.worldcommunitygrid.org/</project_url>
<project_name>World Community Grid</project_name>
<name>E201216_364_A.28.C23H13NS3Se.78.0.set1d06_1_4</name>
<nbytes>29968736.000000</nbytes>
<max_nbytes>152428800.000000</max_nbytes>
<status>1</status>
<generated_locally/>
<upload_when_present/>
<persistent_file_xfer>
<num_retries>0</num_retries>
<first_request_time>1297734069.663286</first_request_time>
<next_request_time>1297734069.663286</next_request_time>
<time_so_far>19.809982</time_so_far>
<last_bytes_xferred>2605056.000000</last_bytes_xferred>
</persistent_file_xfer>
<file_xfer>
<bytes_xferred>2637824.000000</bytes_xferred>
<file_offset>0.000000</file_offset>
<xfer_speed>151784.550684</xfer_speed>
<url>https.//cleanenergy.worldcommunitygrid.org/prod/cep2/file_upload_handler</url>
</file_xfer>
</file_transfer>
</file_transfers>
</boinc_gui_rpc_reply>
<boinc_gui_rpc_reply>
<file_transfers>
<file_transfer>
<project_url>http.//setiathome.berkeley.edu/</project_url>
<project_name>SETI@home</project_name>
<name>31my10ab.19370.391277.14.10.150</name>
<nbytes>375337.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>0</status>
<persistent_file_xfer>
<num_retries>44</num_retries>
<first_request_time>1296830886.620766</first_request_time>
<next_request_time>1297772809.959487</next_request_time>
<time_so_far>317.368349</time_so_far>
<last_bytes_xferred>52397.000000</last_bytes_xferred>
</persistent_file_xfer>
</file_transfer>
<file_transfer>
<project_url>http.//setiweb.ssl.berkeley.edu/beta/</project_url>
<project_name>SETI@home Beta Test</project_name>
<name>ap_06no10ad_B3_P1_00144_20110123_03759.wu</name>
<nbytes>8392046.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>0</status>
<persistent_file_xfer>
<num_retries>39</num_retries>
<first_request_time>1296921440.453398</first_request_time>
<next_request_time>1297734925.966553</next_request_time>
<time_so_far>313.892300</time_so_far>
<last_bytes_xferred>552109.000000</last_bytes_xferred>
</persistent_file_xfer>
</file_transfer>
<file_transfer>
<project_url>http.//www.worldcommunitygrid.org/</project_url>
<project_name>World Community Grid</project_name>
<name>E201216_364_A.28.C23H13NS3Se.78.0.set1d06_1_4</name>
<nbytes>29968736.000000</nbytes>
<max_nbytes>152428800.000000</max_nbytes>
<status>1</status>
<generated_locally/>
<upload_when_present/>
<persistent_file_xfer>
<num_retries>0</num_retries>
<first_request_time>1297734069.663286</first_r
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00412DC7 read attempt to address 0xF8C3A793
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
The reason for AccVio still does match the 2.33 and 2.34 versions.
(Because of Akismet, I had to modify all URLs.)
Peter |
|
|