Problem with Linux locking up machines???

log in

Advanced search

Message boards : Number crunching : Problem with Linux locking up machines???

Previous · 1 · 2
Author Message
Profile STE\/E
           
Avatar
Send message
Joined: 28 Mar 10
Posts: 642
Credit: 3,866,603
RAC: 463
Total hours: 20,097,003
Message 1111 - Posted: 27 Mar 2013, 8:34:39 UTC

lol ... Had one of my Linux Box's lock up several times this morning because of Internet issues with it, lost about 150 Hr's of SLinCA Wu's to Computation Errors when it locked up, I moved the Wireless pickup on the Box & seems okay for now ...
____________
https://signature.statseb.fr/sig-1323.png
https://stats.free-dc.org/badgesbanner.php?cpid=13a87c3a303bcdca4ba0ed600daebb6b

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1112 - Posted: 27 Mar 2013, 15:10:27 UTC - in response to Message 1111.
Last modified: 27 Mar 2013, 15:24:27 UTC

I had internet issues as well just a while ago... had several WuProp WU's error out and lock up the network again...

I thought it was fixed, guess not.

NO OTHER PROJECT WORK UNITS LOCK UP OR ERROR OUT!!!!!!

WUProp has GOT to fix this... I'm letting what is there run out... but I set NNT on all machines now...

Sigh... like 16 compute errors all today...

http://wuprop.boinc-af.org/results.php?userid=4388&offset=0&show_names=0&state=5


99% it's the linux boxes that have the problem, but today, one Windows 7 box did it too.

I've done everything I can on my end, it's up to the folks here to fix it now. I can't imagine WHY an internet interruption would cause a WU failure... doesn't make sense!

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1113 - Posted: 28 Mar 2013, 0:17:06 UTC

Okay, new experiment.. The router I've been using is a new ASUS RT-N66U with stock firmware. I went back to my Linksys WRT320N with DD-WRT firmware, restarted all the computer links, restarted all the WuProp stuff, then set about trying to MAKE IT screw up.

I've reset the router, reset the modem, changed the Global MTU and changed Bandwidth from 20 to 40 and back to 20 and NOTHING is messing up!

Soo, now I'm going to see about putting DD-WRT into the new router and see if that makes a diff..

I'll report back.

8-)

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1114 - Posted: 28 Mar 2013, 1:18:23 UTC - in response to Message 1113.

Update:

ONE system had an error with a WuProp task with my continued fiddling,

http://wuprop.boinc-af.org/result.php?resultid=25668779

BUT, it didn't lock up the network as before and no other failures... so DD-WRT seems to fix something in the WRT320N router.

I'm in the process of upgrading the Asus RT-N66U router with DD-WRT now...

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1118 - Posted: 29 Mar 2013, 1:08:14 UTC
Last modified: 29 Mar 2013, 1:10:25 UTC

I give up...

It is still generating errors that clobber other project WU's and seems related to the internet speed.

Two routers, stock and with DD-WRT installed and get same problems.

There is something going on that may be time related... may a response is required either from the WU or from the website it reports to in a certain time frame or something, but whatever it is, it surely HATES WiFi connections.

I have 7 systems connected with WiFi and have no other problems with any other project... just WUProp...

I give up now and set NNT on all the systems. Maybe later when I get them rack mounted on a switch it will work better, but for now, I just give up.

In a nutshell, this is what I have done:

Install/Upgrade 7 systems with 3.5.x kernal
Install several flavors of BOINC from 7.0.28 to 7.0.58 on 8 systems
Tried every good WiFi channel (1,6,11) plus various bandwidth and power settings
Tried 2 different WiFi routers with stock and DD-WRT firmware with each setup
Rebooted 7 systems 60+ times at least... by hand with my ONE portable monitor.

In all this, it also randomly fails on the 2 windows boxes, but that seems to happen only when the network gets clobbered by another box.

WUProp is for sure identified as the nasty WU that clobbers my systems.

I won't run it anymore until a positive reason for this behavior is deduced and a positive fix implemented.

8-)

Profile STE\/E
           
Avatar
Send message
Joined: 28 Mar 10
Posts: 642
Credit: 3,866,603
RAC: 463
Total hours: 20,097,003
Message 1120 - Posted: 29 Mar 2013, 11:17:37 UTC
Last modified: 29 Mar 2013, 11:24:32 UTC

Same here Tex >>> http://wuprop.boinc-af.org/forum_thread.php?id=176

I've lost over 300 Hr's of SLinCA Wu's this morning due to the FUBAR WUProp Wu erring & taking every running Wu with it on my Linux Box's ... It's like I have to run 1000 Hr's of Wu's & hope 100 Hr's of them make it without the WUProp Wu freaking everything up. I'm getting to the point it's not worth running this Project any more with all the lost work it creates on the other Projects ...
____________
https://signature.statseb.fr/sig-1323.png
https://stats.free-dc.org/badgesbanner.php?cpid=13a87c3a303bcdca4ba0ed600daebb6b

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1124 - Posted: 29 Mar 2013, 21:08:19 UTC
Last modified: 29 Mar 2013, 21:11:31 UTC

I'm not running it anymore and will not.

If you trace the IP's it connects to when it runs, it goes to the statistics website where I am sure there is some code to talk to it. But even so, even if the net crashes or the ISP goes down or the router breaks, no other projects crash! They just keep on crunching with delayed UL/DL status as WU's are completed.

Not so with WuProp... it clobbers BOINC and the LAN for some reason whenever a realtime network glitch happens or something times out. I swear the way it clobbers things so badly that I bet there is some code errors in there, like references to && or *p without external defines or something. Really stinks of memory leak/execution pathway/stack corruption...

I won't tolerate that anymore at all... some LONG WU's like The Lattice Project and RNA can take a week or more to complete and wuprop trashes them.

Not a happy camper anymore with this project on Linux.

8-(

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1125 - Posted: 31 Mar 2013, 2:34:20 UTC

ZERO problems, ZERO glitches, ZERO weird things happening since I stopped running WUProp tasks..

What a breath of fresh air... no more random WiFi dropouts, no more random errors... everything running so perfectly and stable that I can't believe it...

I can finally stop monitoring the systems so much... they all work flawlessly now.

8-)

Dr Who Fan
     
Avatar
Send message
Joined: 29 Jul 11
Posts: 316
Credit: 1,154,933
RAC: 362
Total hours: 1,596,241
Message 1131 - Posted: 1 Apr 2013, 1:23:21 UTC

I am also seeing some STRANGE problems with this project trashing work-in-progress and TURNING BACK THE CLOCK.

*Only* is happening on WINDOWS PC's. EXAMPLE:

3/31/2013 7:59:58 PM | WUProp@Home | Sending scheduler request: To fetch work.
3/31/2013 7:59:58 PM | WUProp@Home | Requesting new tasks for CPU
3/31/2013 7:59:58 PM | | [http] HTTP_OP::init_post(): http://wuprop.boinc-af.org/wuproj_cgi/cgi
3/31/2013 7:59:58 PM | | [http] HTTP_OP::libcurl_exec(): ca-bundle set
3/31/2013 8:00:26 PM | | [http] [ID#1] Info: Connection #0 seems to be dead!
3/31/2013 8:00:26 PM | | [http] [ID#1] Info: Closing connection #0
3/31/2013 8:00:28 PM | | [http] [ID#1] Info: About to connect() to wuprop.boinc-af.org port 80 (#0)
3/31/2013 8:00:28 PM | | [http] [ID#1] Info: Trying 46.105.102.130...
3/31/2013 8:00:29 PM | | [http] [ID#1] Info: Connected to wuprop.boinc-af.org (46.105.102.130) port 80 (#0)
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: POST /wuproj_cgi/cgi HTTP/1.1
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: User-Agent: BOINC client (windows_intelx86 6.12.34)
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: Host: wuprop.boinc-af.org
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: Accept: */*
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: Accept-Encoding: deflate, gzip
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: Content-Type: application/x-www-form-urlencoded
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: Content-Length: 8048
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server: Expect: 100-continue
3/31/2013 8:00:29 PM | | [http] [ID#1] Sent header to server:
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: HTTP/1.1 100 Continue
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: HTTP/1.1 200 OK
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: Date: Mon, 01 Apr 2013 00:59:37 GMT
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: Server: Apache/2.2.16 (Debian)
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: Vary: Accept-Encoding
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: Content-Encoding: gzip
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: Content-Length: 1513
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server: Content-Type: text/xml
3/31/2013 8:00:30 PM | | [http] [ID#1] Received header from server:
3/31/2013 8:00:31 PM | | [http] [ID#1] Info: Connection #0 to host wuprop.boinc-af.org left intact
3/31/2013 8:00:32 PM | WUProp@Home | Scheduler request completed: got 1 new tasks
3/31/2013 8:00:35 PM | WUProp@Home | Starting task wu_v3_1364636537_69424_0 using data_collect_v3 version 342
3/31/2013 8:02:01 PM | FreeHAL@home | Task freehal_wu_nci_7717_input-00460-large_0 exited with zero status but no 'finished' file
3/31/2013 8:02:01 PM | FreeHAL@home | If this happens repeatedly you may need to reset the project.
3/31/2013 8:02:01 PM | correlizer | Task rc_9809187_1 exited with zero status but no 'finished' file
3/31/2013 8:02:01 PM | correlizer | If this happens repeatedly you may need to reset the project.
3/31/2013 8:02:01 PM | WUProp@Home | Task wu_v3_1364636537_69424_0 exited with zero status but no 'finished' file
3/31/2013 8:02:01 PM | WUProp@Home | If this happens repeatedly you may need to reset the project.

____________

nanoprobe
   
Avatar
Send message
Joined: 20 Feb 13
Posts: 34
Credit: 653,713
RAC: 0
Total hours: 3,329,673
Message 1133 - Posted: 1 Apr 2013, 15:20:38 UTC - in response to Message 1131.

Same problem here on WCG tasks.

Tex1954
 
Send message
Joined: 1 Jul 11
Posts: 29
Credit: 126,167
RAC: 0
Total hours: 35,245
Message 1539 - Posted: 31 Aug 2013, 3:45:00 UTC - in response to Message 736.

As of today, I still do not run WUProp anymore and have experienced ZERO LAN/Internet/system crashes...

So not running this app cures all my problems, BUT, I see others still have a LOT of problems...

Guess I'm done on this... Until I see zero problems...

8-)

Previous · 1 · 2
Post to thread

Message boards : Number crunching : Problem with Linux locking up machines???


Home | My Account | Message Boards | Results


Copyright © 2024 Sebastien