Problem with Linux locking up machines???

Message boards : Number crunching : Problem with Linux locking up machines???
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 736 - Posted: 17 Nov 2012, 17:16:56 UTC
Last modified: 17 Nov 2012, 17:18:26 UTC

I have a problem running it on a Linux box in that it generates compute errors and LOCKS UP my router clobbering all my boxes on it. I've had this problem for months, blaming Linux and discovered it maybe doesn't seem to be a generic LINUX problem after all.

Since I took WUProp off the Linux machines, I've had no more problems.

Something buggy with Linux version.?.?.? since it is related to uncontrolled polling of the LAN and seems to affect all machines at once and seems to lock them up generating a Compute Error at the same time. (seen after machines reset)

This is very more repeatable running a SINGLE PROJECT with long WU's on one multi-core CPU. In other words, it seems to happen less often when running several different projects with shorter/longer tasks mixed in.

Doesn't matter if one Linux box is running or six... it will lock up my Linksys router, generate a compute error and clobbers my LAN requiring me to reset all the machines. It does this with 2 different routers...

I have no idea if it is something inside Linux that WUProp causes to happen or visa versa...

Who knows, hard to tell when the boxes and LAN are hung up and no way to talk/access their current state without a reset... but after months, removing WUProp cured the problem...

HELP!!!

8-)
ID: 736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 737 - Posted: 18 Nov 2012, 18:56:33 UTC
Last modified: 18 Nov 2012, 19:01:36 UTC

I was reading other posts looking for similar problems and wonder if the long run bug or whatever could also be a factor...

In all other WU's I run, if a WU's finishes, it stops running and enters the upload step. If it can't upload or the network is down, what happens in Linux?

As I stated before, when this happens, it clobbers the entire lan and locks up every machine AND I run this on every box... well, not the linux boxes as of a few days ago. So far, everything fine.

But, I can verify that in the past, I dedicated ALL my Linux boxes to ONE project and this lockup problem ALWAYS happened... (Running Optima@home at the time)

I stopped running that project AND Linux on those boxes for a while because of this highly repeatable problem and only recently with the new kernal updates tried Linux again..

I'm running only Asteroids@home now on Linux 64b boxes and it works perfectly for days and days so long as WUProp isn't running! Buggers up if I let WUProp run on those Linux boxes... but NEVER had a problem on Windows 7 64b machines and still don't.


8-)
ID: 737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>WildWildWest] Sébastie...
     
Project administrator
Avatar

Send message
Joined: 28 Mar 10
Posts: 2875
Credit: 539,231
RAC: 136
Message 738 - Posted: 19 Nov 2012, 17:41:22 UTC

Could you test this application?

For testing the application:

  • Stop BOINC
  • Extract archive in BOINC directory (archive contains application and app_info.xml
  • Run BOINC
  • Allow work for WuProp


ID: 738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 740 - Posted: 21 Nov 2012, 0:53:16 UTC - in response to Message 738.  

Could you test this application?

For testing the application:

  • Stop BOINC
  • Extract archive in BOINC directory (archive contains application and app_info.xml
  • Run BOINC
  • Allow work for WuProp



I would be happy to test it... but not sure how??? I see it includes an app_info file as well...

Can you give this Linux novice some detailed help on testing it?

Thanks!

8-)
ID: 740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 741 - Posted: 21 Nov 2012, 2:54:46 UTC - in response to Message 740.  
Last modified: 21 Nov 2012, 2:55:57 UTC

Umm, I got it figured out... just installed it in the /var/lib/boinc-client/projects/wuprop.boinc-af.org directory...

I'm testing it now... It's on a machine as before, runs one project on all 6 cores and nothing else..

I must admit, took me a bit to chmod the permissions to get access... but no problemo after that.

I'll let you know if I experience more probs...

8-)
ID: 741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 742 - Posted: 21 Nov 2012, 5:15:16 UTC - in response to Message 741.  

Normally by now, I would have had some sort of error, but nothing has happened yet.

I installed the files on a second Linux box also running a single project on 6 cores... so far so good.

If it runs well the next couple of days, I would call it good. I can Remote desktop into the computers fine, do LAN I/O fine, all running fine so far.

Crossing fingers!

8-)
ID: 742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 743 - Posted: 22 Nov 2012, 13:23:54 UTC - in response to Message 742.  

I don't know what you did, but I haven't had a single problem at all so far and I am positive I would have...

In this instance, maybe it is safe to call it good????

Working perfectly, no hangups of weird LAN polling or anything...

THANKS!


8-)

ID: 743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1083 - Posted: 21 Mar 2013, 18:32:38 UTC

Early this morning, all eight of my Linux boxes crashed taking out several tasks in progress and locking up the network.

It is definitely something that has to do with internet interruptions. The cable service I have goes out once in a while late at night for maintenance, or else just dies...

When the internet connection is lost, WUProp goes nuts, locks up BOINC and then I observed tasks starting and stopping... Nothing else could run until the net came back up, then every box had a WUProp computation error and also many crashed WU's from projects.

It didn't matter if the box was on the LAN or wireless and only happened on Linux systems, not the Windows 7 boxes...

This is very annoying... has been happening less often with the last update, but early this morning it really clobbered everything badly.

I thought his was fixed... guess not..
ID: 1083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
       
Avatar

Send message
Joined: 28 Mar 10
Posts: 588
Credit: 1,221,647
RAC: 237
Message 1087 - Posted: 21 Mar 2013, 20:59:58 UTC

G'Day Tex,
Just wondering if updating your kernel to a later version may be of assistance to you working out your problem?
I run Linux (Fedora 16 64 bit) with a kernel of 3.6, whereas yours is version 3.0.
Just a thought as I am not having the issues you are having, I use a Netgear router which seems to work fine with both my Linux and Windows computers.

Conan
ID: 1087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1089 - Posted: 21 Mar 2013, 23:00:10 UTC - in response to Message 1087.  

Possibly that may work. The problem is, I'm a total noob to Linux and if it don't install by itself, I have no clue what to do.

To date, I've tried about 25 other different distros and the only one that works on all my boxes is ubuntu... much as I hate it.

I tried Fedora 18 in a virtual box and tried to get their 7.x.x version of boinc client going and no luck their either... but it was close!!!

Arch is one I been holding back on because it's all manual... but may try that.

The only common thing on the Linux boxes is the version of Linux and boinc client... so maybe something their..

8-)
ID: 1089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1091 - Posted: 23 Mar 2013, 4:09:37 UTC

I tried several things, was able to get the kernal udated a couple times, but in both instances, it ended up breaking the Nvidia driver with no cure working.

I tried Fedora 18 and couldn't get any Boinc running.

I tried Arch and it's a major pain in a VM, so gave up on that.

I've tried 25 flavors of Linux 64b and can't find a single one that will work properly...

Soo, I suppose I'm stuck for now... But I did like the new kernals! They fix a lot of bugs, especially with regards to LM-Senors operation and such...

Thing is, I am only TESTING GPU's on Linux for now... so I could update the kernal later maybe... if it will drive the onboard video properly...

Anyways, Linux burns me out... I would PAY someone to make me a custom version in the future if I can't get what I want going properly...

8-)
ID: 1091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1093 - Posted: 23 Mar 2013, 19:14:24 UTC
Last modified: 23 Mar 2013, 19:18:46 UTC

Had it happen in front of me today... first one system, then slowly it started to kill the other systems one by one.

Seems there was a project update request that started the trouble, but can't be certain... scenario looks like this...

I'm not sure at this point which machine errored first, I think it was the first set... right after a server update request, it jams up... meanwhile, all other projects working fine until the LAN locks up...

6143 WUProp@Home 3/23/2013 1:05:44 PM Sending scheduler request: Requested by project.
6144 WUProp@Home 3/23/2013 1:05:44 PM Not reporting or requesting tasks
6145 WUProp@Home 3/23/2013 1:05:46 PM Scheduler request completed
6303 WUProp@Home 3/23/2013 1:33:40 PM Computation for task wu_v3_1363211664_405728_0 finished
6310 WUProp@Home 3/23/2013 1:33:42 PM Started upload of wu_v3_1363211664_405728_0_0
6314 WUProp@Home 3/23/2013 1:35:03 PM Temporarily failed upload of wu_v3_1363211664_405728_0_0: can't resolve hostname
6315 WUProp@Home 3/23/2013 1:35:03 PM Backing off 14 min 23 sec on upload of wu_v3_1363211664_405728_0_0

3975 WUProp@Home 3/23/2013 1:28:24 PM Computation for task wu_v3_1363211664_405860_0 finished
3985 WUProp@Home 3/23/2013 1:29:48 PM Temporarily failed upload of wu_v3_1363211664_405860_0_0: can't resolve hostname
3986 WUProp@Home 3/23/2013 1:29:48 PM Backing off 13 min 57 sec on upload of wu_v3_1363211664_405860_0_0
3993 WUProp@Home 3/23/2013 1:31:09 PM Scheduler request failed: Couldn't resolve host name
4012 3/23/2013 1:31:39 PM Project communication failed: attempting access to reference site
4015 3/23/2013 1:33:01 PM BOINC can't access Internet - check network connection or proxy configuration.

Notice it was supposed to backoff 13 minutes in second machine but did not! It's like the App itself is forcing communications or something.. then it locks up the LAN somehow...

Look how they all get computation errors at the same time!

WUProp@Home 3.42 Data collect version 3 (nci) wu_v3_1363211664_405860_0 02:15:55 (00:00:03) 3/23/2013 1:40:38 PM 3/23/2013 1:41:22 PM 0.04 Reported: Computation error (11,) Linux-2600K
WUProp@Home 3.42 Data collect version 3 (nci) wu_v3_1363211664_405850_0 02:20:24 (00:00:06) 3/23/2013 1:40:38 PM 3/23/2013 1:47:54 PM 0.07 Reported: Computation error (11,) Linux-F1
WUProp@Home 3.42 Data collect version 3 (nci) wu_v3_1363211664_405825_0 02:20:55 (00:00:00) 3/23/2013 1:40:38 PM 3/23/2013 1:41:22 PM 0.00 Reported: Computation error (11,) Linux-F12
WUProp@Home 3.42 Data collect version 3 (nci) wu_v3_1363211664_405728_0 02:26:36 (00:00:06) 3/23/2013 1:40:38 PM 3/23/2013 1:42:12 PM 0.07 Reported: Computation error (11,) Linux-F13

I'll keep checking, and I don't think it's my Kernal now since others run kernals 2.6 up to 3.4...

8-)
ID: 1093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1095 - Posted: 23 Mar 2013, 19:35:21 UTC
Last modified: 23 Mar 2013, 19:36:33 UTC

Not certain if this the first, but this is how BOINC crashes restarting tasks..

Notice the network is fine until the WUProp WU completes (it actually crashed with computational error). After the crash, the network is killed... then BOINC goes crazy restarting tasks...

Linux-F13

6296 OPTIMA@HOME 3/23/2013 1:23:03 PM Sending scheduler request: To fetch work.
6297 OPTIMA@HOME 3/23/2013 1:23:03 PM Requesting new tasks for CPU
6298 OPTIMA@HOME 3/23/2013 1:23:05 PM Scheduler request completed: got 0 new tasks
6299 OPTIMA@HOME 3/23/2013 1:23:05 PM No work sent
6300 OPTIMA@HOME 3/23/2013 1:23:05 PM (reached limit of 40 tasks)
6301 OPTIMA@HOME 3/23/2013 1:32:18 PM Sending scheduler request: To fetch work.
6302 OPTIMA@HOME 3/23/2013 1:32:18 PM Requesting new tasks for CPU
6303 WUProp@Home 3/23/2013 1:33:40 PM Computation for task wu_v3_1363211664_405728_0 finished
6304 OPTIMA@HOME 3/23/2013 1:33:40 PM Scheduler request failed: Couldn't resolve host name
6305 Einstein@Home 3/23/2013 1:33:41 PM Task p2030.20121223.G202.81-01.04.C.b5s0g0.00000_3128_1 exited with zero status but no 'finished' file
6306 Einstein@Home 3/23/2013 1:33:41 PM If this happens repeatedly you may need to reset the project.
6307 Einstein@Home 3/23/2013 1:33:41 PM Restarting task p2030.20121223.G202.81-01.04.C.b5s0g0.00000_3128_1 using einsteinbinary_BRP4 version 133
6308 Einstein@Home 3/23/2013 1:33:42 PM Task p2030.20121223.G202.81-01.04.C.b4s0g0.00000_3152_1 exited with zero status but no 'finished' file
6309 Einstein@Home 3/23/2013 1:33:42 PM If this happens repeatedly you may need to reset the project.
6310 WUProp@Home 3/23/2013 1:33:42 PM Started upload of wu_v3_1363211664_405728_0_0
6311 Einstein@Home 3/23/2013 1:33:42 PM Restarting task p2030.20121223.G202.81-01.04.C.b4s0g0.00000_3152_1 using einsteinbinary_BRP4 version 133
6312 OPTIMA@HOME 3/23/2013 1:35:03 PM Task smallexp_s1_ss3_120_2_n58973_0 exited with zero status but no 'finished' file
6313 OPTIMA@HOME 3/23/2013 1:35:03 PM If this happens repeatedly you may need to reset the project.
6314 WUProp@Home 3/23/2013 1:35:03 PM Temporarily failed upload of wu_v3_1363211664_405728_0_0: can't resolve hostname
6315 WUProp@Home 3/23/2013 1:35:03 PM Backing off 14 min 23 sec on upload of wu_v3_1363211664_405728_0_0
6316 OPTIMA@HOME 3/23/2013 1:35:03 PM Restarting task smallexp_s1_ss3_120_2_n58973_0 using smallexp version 103
ID: 1095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile STE\/E
             
Avatar

Send message
Joined: 28 Mar 10
Posts: 672
Credit: 3,991,805
RAC: 698
Message 1096 - Posted: 23 Mar 2013, 21:54:08 UTC

Don't know if this will work but try running this in your terminal >>> sudo apt-get install openssh-server+ gdebi+ libwxgtk2.8-0+ libXss1+ freeglut3+ gnome-applets+ cpufrequtils+ ia32-libs+

It was given to me by someone else & I use it for every install of Linux & BOINC, it will Download & install everything needed to run BOINC properly for4 UBUNTU 12.10 or .04 ...
https://signature.statseb.fr/sig-1323.png
https://stats.free-dc.org/badgesbanner.php?cpid=13a87c3a303bcdca4ba0ed600daebb6b
ID: 1096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1097 - Posted: 23 Mar 2013, 23:28:31 UTC - in response to Message 1096.  

Don't know if this will work but try running this in your terminal >>> sudo apt-get install openssh-server+ gdebi+ libwxgtk2.8-0+ libXss1+ freeglut3+ gnome-applets+ cpufrequtils+ ia32-libs+

It was given to me by someone else & I use it for every install of Linux & BOINC, it will Download & install everything needed to run BOINC properly for4 UBUNTU 12.10 or .04 ...


I'll try that in a bit... Thanks!

I am trying Ubuntu Server 12.04 now. I got it running in a VM with BOINC sorta running. Problem is, graphic windows are messed up with the Gnome desktop I installed. Possibly it's vnc4server conflicting with desktop sharing??? Don't know yet...

BUT, I like 12.04 server!! No crap, basic, works...

I'll let you know. Next is to uninstall the 7.0.27 and try the 7.0.28!

8-)
ID: 1097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1099 - Posted: 24 Mar 2013, 8:25:21 UTC - in response to Message 1097.  
Last modified: 24 Mar 2013, 8:27:09 UTC

Don't know if this will work but try running this in your terminal >>> sudo apt-get install openssh-server+ gdebi+ libwxgtk2.8-0+ libXss1+ freeglut3+ gnome-applets+ cpufrequtils+ ia32-libs+

It was given to me by someone else & I use it for every install of Linux & BOINC, it will Download & install everything needed to run BOINC properly for4 UBUNTU 12.10 or .04 ...


I'll try that in a bit... Thanks!

I am trying Ubuntu Server 12.04 now. I got it running in a VM with BOINC sorta running. Problem is, graphic windows are messed up with the Gnome desktop I installed. Possibly it's vnc4server conflicting with desktop sharing??? Don't know yet...

BUT, I like 12.04 server!! No crap, basic, works...

I'll let you know. Next is to uninstall the 7.0.27 and try the 7.0.28!

8-)


STE\/E [BADger] You are a lifesaver!!! BLESS YOU!!!!

I've tried for a YEAR to make BOINC work on something other than ubuntu because I wanted to run undated versions for OpenCL support and other obvious reasons. NOBODY could help me with the library problem thing and weeks of Google searching turned up nothing.

For the first time ever, I've got it working on something besides Ubuntu Desktop!!!

Fifty cheers and 75 virgins to you!!!!

GAWD I can't believe it!

IT WORKS!!!

I'm playing now with a couple things, the server version may be my easiest method... but I want an updated kernal too, so maybe Fedora or ARCH or something... in any case, phase one is use what works which is Ubuntu server 12.04 with gnome GUI...

THANK YOU A MILLION TIMES!!!

8-)

PS: And this may solve wuprop problems!!!
ID: 1099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1100 - Posted: 25 Mar 2013, 4:10:29 UTC
Last modified: 25 Mar 2013, 4:29:43 UTC

Okay, Ubuntu Broke the server releases in that they won't detect/use my Asus WiFi USB Dongle... so, tried Linux Mint Mate and that was bad... Tried a few others and they had various problems...

Another thing is Unbuntu 11.10 and lower are missing a certain library that has to be compiled against the kernal and it's not an easy task... so something like kernal 3.3 or higher is needed..

I loaded Linux Mint 14 "Nadia" Cinnamon because it HAD the proper features (like desktop sharing etc.) that were broken in the "MATE" version and that works great... late version kernal and all!!!

I have various flavors of BOINC 7.0.28,56,58 running on 4 boxes now under Mint...

We will see how things go... if no problems, I'll upgrade the other 4 boxes...

Thanks for the help! So far so good!

9-)
ID: 1100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile STE\/E
             
Avatar

Send message
Joined: 28 Mar 10
Posts: 672
Credit: 3,991,805
RAC: 698
Message 1101 - Posted: 25 Mar 2013, 9:28:19 UTC

:) ... Glad it worked for you, the line I gave you is a Life Saver for me too, Thanks goes to Zombie for that ...
https://signature.statseb.fr/sig-1323.png
https://stats.free-dc.org/badgesbanner.php?cpid=13a87c3a303bcdca4ba0ed600daebb6b
ID: 1101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1104 - Posted: 26 Mar 2013, 7:37:40 UTC - in response to Message 1101.  

All the Linux boxes are now upgraded and so far not a hint of any problem...

Sooo, it would appear the combination of an older kernal and older BOINC client may have caused the problem...

Anyways, so far so good...

Thanks again!

8-)

ID: 1104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954
 

Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Message 1108 - Posted: 26 Mar 2013, 21:00:22 UTC

I consider the problem solved. Everything is working perfectly, the remote desktop is 10 times faster connecting and updating than with anything Ubuntu... All the 7.x.xx clients I have tried work!

My post about this experience is HERE.

Thanks again to Ste\/e and I suggest everyone having problems with a BOINC targeted Linux install use Linux Mint "Nadia" 64b version!

Enjoy!

8-)
ID: 1108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Problem with Linux locking up machines???

©2024 Sébastien