Why has the project been down for multiple consecutive hours in last number of days?

Message boards : Number crunching : Why has the project been down for multiple consecutive hours in last number of days?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Dr Who Fan
     
Avatar

Send message
Joined: 29 Jul 11
Posts: 334
Credit: 1,240,257
RAC: 321
Message 9869 - Posted: 26 Mar 2023, 5:50:22 UTC

This was the 4th time the project was down in less than a week for multiple consecutive hours without any explanation.

And now it's back online with all the projects I have worked on in last day reporting negative hours.

Did the project have another database problem or something else break?

ID: 9869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul
   

Send message
Joined: 9 Feb 14
Posts: 26
Credit: 1,154,782
RAC: 96
Message 9870 - Posted: 26 Mar 2023, 15:28:37 UTC - in response to Message 9869.  
Last modified: 26 Mar 2023, 16:24:30 UTC

My hours are not increasing but no negatives this time.
Edit: Now incrementing.

Better to return to 6 hour work units to reduce server load.

Paul.
ID: 9870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
 
Avatar

Send message
Joined: 20 Jun 12
Posts: 141
Credit: 342,004
RAC: 64
Message 9905 - Posted: 16 Apr 2023, 14:15:10 UTC
Last modified: 16 Apr 2023, 14:20:17 UTC

Seem to have been down for about 7 hours and this time I have negative hours. I agree, better return to the 6 hours WUs (or even better: let us choose like Rosetta), that will not only reduce the server load, we will also get better chance to make it thru the outage without loosing any hours.
ID: 9905 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
     
Avatar

Send message
Joined: 20 May 10
Posts: 552
Credit: 1,901,961
RAC: 792
Message 9906 - Posted: 16 Apr 2023, 20:09:45 UTC - in response to Message 9905.  

Seem to have been down for about 7 hours and this time I have negative hours. I agree, better return to the 6 hours WUs (or even better: let us choose like Rosetta), that will not only reduce the server load, we will also get better chance to make it thru the outage without loosing any hours.


I like this idea as it might make it alot better for Android devices with the smaller tasks while desktops, Servers and laptop can easily handle the 6 hour tasks with a lot less bandwidth and wear and tear on the wuprop hardware.

BUT what I'd like to really know is if there is anything we users can do to help alleviate any future unplanned outages ie help with newer and bigger hard drives, more memory for the Server or pc(s) it runs on, more bandwidth etc etc.
ID: 9906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr Who Fan
     
Avatar

Send message
Joined: 29 Jul 11
Posts: 334
Credit: 1,240,257
RAC: 321
Message 9907 - Posted: 19 Apr 2023, 7:59:33 UTC - in response to Message 9906.  
Last modified: 19 Apr 2023, 8:02:29 UTC

Another 6 to 8 hour outage today. The multi-hour outages appear to come in waves aprox Two days apart in a week.
Loosing many hours on apps due to the extended outages.
Agree Time to go back to a 6 hour task across the board.

*** EDIT TO ADD >>> AND THE NEGATIVE HOURS ARE BACK ALSO <<< ***

ID: 9907 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
     
Avatar

Send message
Joined: 20 May 10
Posts: 552
Credit: 1,901,961
RAC: 792
Message 9908 - Posted: 19 Apr 2023, 11:15:20 UTC - in response to Message 9907.  

Another 6 to 8 hour outage today. The multi-hour outages appear to come in waves aprox Two days apart in a week.
Loosing many hours on apps due to the extended outages.
Agree Time to go back to a 6 hour task across the board.

*** EDIT TO ADD >>> AND THE NEGATIVE HOURS ARE BACK ALSO <<< ***


Another thought I had was to treat the tasks like other Boinc Projects and just send out several at a time, that way if the Server is down we just move to the next task, and then the task after that etc. Boinc uses a first in first out basis for it's tasks, which can then be affected by return times, but instead of banging the Server we would just hold the completed tasks and not lose any time until the Server is ready to take them all back again.
ID: 9908 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
 
Avatar

Send message
Joined: 20 Jun 12
Posts: 141
Credit: 342,004
RAC: 64
Message 9909 - Posted: 19 Apr 2023, 16:08:01 UTC - in response to Message 9908.  

Boinc uses a first in first out basis for it's tasks, which can then be affected by return times, but instead of banging the Server we would just hold the completed tasks and not lose any time until the Server is ready to take them all back again.

Won't it run them all at once, since they are NCI? Goofyxgrid has also send just one task per app and they were all running concurrently.
ID: 9909 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr Who Fan
     
Avatar

Send message
Joined: 29 Jul 11
Posts: 334
Credit: 1,240,257
RAC: 321
Message 9910 - Posted: 19 Apr 2023, 16:27:29 UTC - in response to Message 9909.  

Boinc uses a first in first out basis for it's tasks, which can then be affected by return times, but instead of banging the Server we would just hold the completed tasks and not lose any time until the Server is ready to take them all back again.

Won't it run them all at once, since they are NCI? Goofyxgrid has also send just one task per app and they were all running concurrently.

More than likely. I don't think BOINC NCI works that way - all or nothing run.

Every NCI project I have attached & completed work on has only sent one task per computer except the old GoofyGrid would send multiples at times and occasionally I get multiple tasks running at once on iThena's main NCI project.
ID: 9910 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr Who Fan
     
Avatar

Send message
Joined: 29 Jul 11
Posts: 334
Credit: 1,240,257
RAC: 321
Message 9912 - Posted: 20 Apr 2023, 5:03:42 UTC

And we're back from another multi hour outage along with negative hours for the third day now.....
ID: 9912 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Werinbert
     

Send message
Joined: 9 May 13
Posts: 98
Credit: 762,759
RAC: 280
Message 9913 - Posted: 20 Apr 2023, 5:36:46 UTC

I am happy that we have this project even if it is occasionally intermittent. Complaining each and every time that the server hiccups doesn't do anyone any good. Sebastien needs sleep just like the rest of us, cut him some slack.
ID: 9913 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
     
Avatar

Send message
Joined: 20 May 10
Posts: 552
Credit: 1,901,961
RAC: 792
Message 9914 - Posted: 20 Apr 2023, 10:24:42 UTC - in response to Message 9913.  

I am happy that we have this project even if it is occasionally intermittent. Complaining each and every time that the server hiccups doesn't do anyone any good. Sebastien needs sleep just like the rest of us, cut him some slack.


I don't think it's the complaining so much as that there's been no explanation of why and if anyone of us can help solve the problem. I know Sebastien got some helpers awhile back when the Project was on the verge of collapsing, does he need more? Does he need some hardware that keeps failing? Is it a software problem? Is it an ISP problem? In short people are trying to figure out if they can help but with no word coming from 'the Team, it's kinda hard.
ID: 9914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
 
Avatar

Send message
Joined: 20 Jun 12
Posts: 141
Credit: 342,004
RAC: 64
Message 9915 - Posted: 20 Apr 2023, 14:27:26 UTC - in response to Message 9913.  
Last modified: 20 Apr 2023, 14:29:22 UTC

Complaining each and every time that the server hiccups doesn't do anyone any good.

We are not complaining, at least I'm not, just reporting an issue, which he might even not notice otherwise if it's always fixing itself after couple of hours. I don't see anything wrong with reporting bugs or other issues to the admin/developer, that's how admins/devs get to know there's something wrong at all with their servers/software/whatever in most cases, they might not notice the issue from their end without the reports.
ID: 9915 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Dodd
       

Send message
Joined: 28 Jan 13
Posts: 40
Credit: 1,408,400
RAC: 552
Message 9916 - Posted: 20 Apr 2023, 16:09:12 UTC - in response to Message 9869.  

Getting a couple of projects reporting negative hours myself.
ID: 9916 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
     
Avatar

Send message
Joined: 20 May 10
Posts: 552
Credit: 1,901,961
RAC: 792
Message 9918 - Posted: 20 Apr 2023, 23:18:49 UTC - in response to Message 9916.  

Getting a couple of projects reporting negative hours myself.


I just got a huge update on some of my projects!!
ID: 9918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Dodd
       

Send message
Joined: 28 Jan 13
Posts: 40
Credit: 1,408,400
RAC: 552
Message 9919 - Posted: 20 Apr 2023, 23:25:43 UTC - in response to Message 9918.  

Same, Mikey
ID: 9919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr Who Fan
     
Avatar

Send message
Joined: 29 Jul 11
Posts: 334
Credit: 1,240,257
RAC: 321
Message 9920 - Posted: 21 Apr 2023, 0:25:35 UTC - in response to Message 9918.  

Same here.
It appears the "extra" hours are counting about 2.5 to 3 calendar days of back reporting of work based on my last 24 hours per device page.
ID: 9920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WezH
   

Send message
Joined: 8 Oct 12
Posts: 33
Credit: 1,900,596
RAC: 846
Message 9922 - Posted: 21 Apr 2023, 12:19:45 UTC
Last modified: 21 Apr 2023, 12:20:58 UTC

Now it is "Server error: feeder not running"

EDIT: and back online again
ID: 9922 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
 
Avatar

Send message
Joined: 20 Jun 12
Posts: 141
Credit: 342,004
RAC: 64
Message 9940 - Posted: 29 Apr 2023, 11:14:15 UTC - in response to Message 9907.  

Agree Time to go back to a 6 hour task across the board.

If server load because of too many clients connecting the scheduler is the issue, than perhaps increasing <next_rpc_delay> might help a bit at least, currently I see this stupid behavior for each WU:

29/04/2023 12:38:19 | WUProp@Home | Sending scheduler request: Requested by project.
29/04/2023 12:38:19 | WUProp@Home | Not requesting tasks: non CPU intensive
29/04/2023 12:38:20 | WUProp@Home | Scheduler request completed
29/04/2023 12:38:31 | WUProp@Home | Computation for task data_collect_v4_1682702101_52217_0 finished
29/04/2023 12:38:33 | WUProp@Home | Started upload of data_collect_v4_1682702101_52217_0_0
29/04/2023 12:38:35 | WUProp@Home | Finished upload of data_collect_v4_1682702101_52217_0_0
29/04/2023 12:38:35 | WUProp@Home | Sending scheduler request: To report completed tasks.
29/04/2023 12:38:35 | WUProp@Home | Reporting 1 completed tasks
29/04/2023 12:38:35 | WUProp@Home | Requesting new tasks for CPU
29/04/2023 12:38:36 | WUProp@Home | Scheduler request completed: got 1 new tasks
29/04/2023 12:38:38 | WUProp@Home | Started download of data_collect_v4_1682702101_43298
29/04/2023 12:38:39 | WUProp@Home | Finished download of data_collect_v4_1682702101_43298
29/04/2023 12:38:39 | WUProp@Home | Starting task data_collect_v4_1682702101_43298_1

The first request is completely unnecessary and without <report_results_immediately/> in app_config.xml it slows down getting a new WU by few seconds. To avoid this while still keeping the function of forced scheduler requests <next_rpc_delay> should be increased from the current 3600 to 3700 seconds.
ID: 9940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dr Who Fan
     
Avatar

Send message
Joined: 29 Jul 11
Posts: 334
Credit: 1,240,257
RAC: 321
Message 9941 - Posted: 29 Apr 2023, 16:26:20 UTC

I see we had Two "short" outages in about 48 hours..
WuProp database is back up along with the usual NEGATIVE HOURS
ID: 9941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile marmot
     
Avatar

Send message
Joined: 13 Dec 15
Posts: 174
Credit: 2,269,998
RAC: 304
Message 9952 - Posted: 2 May 2023, 10:32:38 UTC

I'm just glad the project is still with us.

Is there a Patreon donation link?
ID: 9952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Why has the project been down for multiple consecutive hours in last number of days?

©2024 Sébastien