Posts by marmot

log in
1) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5875)
Posted 2 Jan 2018 by marmot
A couple of the new WU that showed up in Virtual Box manager were actually 1 core instead of 2, so it was a reporting error in BoincMgr as you proffer. Although I can't be certain it's the same reporting bug, it appears to be.

Too annoying, and not acceptable behavior, while I try and hone in on the optimal settings on the machines.

Keeping the optimized work units separate from a client of work units currently under evaluation is yet another reason to keep the data directories split.
2) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5873)
Posted 31 Dec 2017 by marmot
Encountered another situation today.

Moving from 2 core to 1 core WU on LHC VM's.

Core count is virtual in the Boincmgr until the actual VM instance is created and started.

Changed the core count and RAM size in the app_config, told Boincmgr to read the config file and even restarted Boincmgr without stopping the boinc.exe client, but Boincmgr continued to consider the unstarted LHC WU's to still be 2 core instances.

Only after shutting down the boinc.exe client and restarting did Boincmgr properly register the core count of the VM WU's.
3) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5872)
Posted 31 Dec 2017 by marmot
I still don't get why you're shutting down a client.


but it doesn't clear up the "Environment needs to be cleaned up" and 'Postponed" error messages of the Virtual Box WU's where I have to suspend all work and kill the vboxservice to clean up the errors.

Known defect in LHC WU's that they won't address.


I have to shut down each client and restart Virtual Box minimum every 3 days.

Message discussion here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4526
4) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5870)
Posted 30 Dec 2017 by marmot
the number of processors on each instance so the total number of running tasks does not exceed the number of actual processors.


Is WUProps sophisticated enough to count the number of cores used by the multicore Virtual Boxes?

Instead of restarting BOINC you can have it reread the config files to pickup any changes.


That's nice to know (I'm forgetting things. I used to know that.) but it doesn't clear up the "Environment needs to be cleaned up" and 'Postponed" error messages of the Virtual Box WU's where I have to suspend all work and kill the vboxservice to clean up the errors.

Known defect in LHC WU's that they won't address.

Keeping the client separate is still the best solution for this reason.
5) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5868)
Posted 30 Dec 2017 by marmot
Which project is this?

If you want to run this project that "doesn't make save points for 12 to 18 hours" when there is work available and to keep the CPUs doing something else the rest of the day then you can do this with resource share. Set the 2nd project to 0% resource share and it will only download work when there is nothing else to do.



A save point is where the WU saves it's progress while performing the calculations. The CPU's are used 99.99% of the time. If you interrupt the WU before it makes a save point then 4, 8 or up to 18 hours of 16 or 32 cores of CPU cycles are just thrown into the void.

The split into two clients is very necessary to avoid lost work.


Oh checkpoints, I read it wrong. Not sure a 2nd client is needed for that.


I'm sure.
Unless you know of a way to shutdown boinc.exe that forces all WU to create checkpoints even though the project doesn't support them.

I tried running the project in a VM and it's RAC is 1/4th that of raw so separate client is the best solution. The project remains in RAM and calculating while maintenance is performed on the other projects and 96 to 192 CPU hours are not wasted.
6) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5865)
Posted 30 Dec 2017 by marmot
Which project is this?

If you want to run this project that "doesn't make save points for 12 to 18 hours" when there is work available and to keep the CPUs doing something else the rest of the day then you can do this with resource share. Set the 2nd project to 0% resource share and it will only download work when there is nothing else to do.



A save point is where the WU saves it's progress while performing the calculations. The CPU's are used 99.99% of the time. If you interrupt the WU before it makes a save point then 4, 8 or up to 18 hours of 16 or 32 cores of CPU cycles are just thrown into the void.

The split into two clients is very necessary to avoid lost work.
7) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5863)
Posted 30 Dec 2017 by marmot
(accidental double post)
8) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5862)
Posted 30 Dec 2017 by marmot
The client is split because that particular project doesn't make save points for 12 to 18 hours and I lose 1/2 to 3/4 a day of work shutting down BOINC to deal with the other projects' technical difficulties or to change my app_config files.


Guess I need to ask [AF>WildWildWest]Sebastien for help?
9) Message boards : Number crunching : I have two boincs clients on the same computer (Message 5860)
Posted 30 Dec 2017 by marmot
I'm trying this again but it's not working this time.

Errors:
"12/29/2017 6:10:40 PM | WUProp@Home | Your app_config.xml file refers to an unknown application 'data_collect_v4'. Known applications: None"


And:
"12/29/2017 6:10:56 PM | WUProp@Home | Not accepting requests from this host"
10) Message boards : Number crunching : How to add this project on BOINC for Android (video) (Message 4011)
Posted 14 Jul 2016 by marmot
Here's a video I found on how to add this project to BOINC for Android.

(Yeah, I didn't find the 'add by URL' ability for 20 minutes... doh...)

https://youtu.be/VLYCWGH20po

Oh, and shouldn't there be a Q&A section for Android now?
11) Message boards : Number crunching : WU on one machine never ends. (Message 3650)
Posted 8 Feb 2016 by marmot
Removing the project and re-adding seemed to get WU to actually report again but there are still 3 old WU that read as In Progress.

No way to abort them client side nor here.
12) Message boards : Number crunching : WU on one machine never ends. (Message 3647)
Posted 7 Feb 2016 by marmot
I reset the project twice now and aborted the WU. The first one had been going 12 days and kept saying it had about 11 seconds left. Today, on the same machine, there was one that had been going 26 hours and was still counting about 1 minute 11 seconds.
It's like the work units are in and asymptotic countdown that will take at least a month to finally click over the last second.

What other methods can I try to repair this?

The machine used to work and used to have two clients to split the work load. I tried merging the two ID's but the database says they are incompatible. Is this part of the problem? Should I try to force a new cross ID?

I only see one WU but the last contact was today and shows three WU in progress and that is after resetting the project.

55703061 54492885 7 Feb 2016, 5:48:07 UTC 17 Feb 2016, 5:48:07 UTC In progress --- --- --- Data collect version 4 v4.20 (nci)
55673139 54465380 5 Feb 2016, 22:37:22 UTC 15 Feb 2016, 22:37:22 UTC In progress --- --- --- Data collect version 4 v4.20 (nci)
55616038 54412637 3 Feb 2016, 22:11:02 UTC 13 Feb 2016, 22:11:02 UTC In progress --- --- --- Data collect version 4 v4.20 (nci)
13) Message boards : Number crunching : Validate error's (Message 3560)
Posted 30 Dec 2015 by marmot
Thanks a lot!

Here's the solution from that post (I don't understand why quote or code tags are eliminating white characters on these forums):

    <app_config>
    <app_version>
    <app_name>data_collect_v4</app_name>
    <plan_class>nci</plan_class>
    <avg_ncpus>0.01</avg_ncpus>
    <cmdline>-p xxxx</cmdline>
    </app_version>
    </app_config>



Where xxxx = your client's non-standard port other than 31416.

14) Message boards : Number crunching : mult clients (Message 3559)
Posted 30 Dec 2015 by marmot

Is it still necessary to use app_info.xml? I'm not running multiple clients, but am running boinc on a port other than 31416 and found it only needed the following in an app_config.xml:

<app_config> <app_version> <app_name>data_collect_v4</app_name> <plan_class>nci</plan_class> <avg_ncpus>0.01</avg_ncpus> <cmdline>-p 31417</cmdline> </app_version> </app_config>


This requires a boinc client version of 7.2.39 or higher.


Thanks!
This is just what I needed for splitting into two clients on my 6-core and getting WUProps to read the second client.

The fetch algorithm doesn't read app_config files and so doesn't calculate in the <project_max_concurrent>2</> for ATLAS on an 8GB RAM machine or the 2 core max that the project at vLHC dictates. Nor does the work fetch seem to account for high resource setting project that is out of work for days, which is much more common than I expected. DENIS is reworking their servers, Citizen's grid seems to be off for the last month, etc.

Looks like the federal funding for BOINC is gone so the three dedicated coders (Anderson + others) are off while volunteer coders are taking over. I'm hoping they go back to the times of version BOINC 5 and 6 when it was quite easy to running multiple clients and the Windows registry wasn't relied upon and each client had it's own data directory along with a command line switch to tell the client where to find it's data directory. At least the windows installation engine should be fully aware of installing multiple copies and allow startup of multiple managers if they continue to rely on the Windows registry.

Oh I dream of a BOINCMGR.exe GUI that has commands on the left which allow you to assign a project particular cores of your computer, a resource allocation system that doesn't care about RAC but only the amount of hours a project gets on it's assigned cores and a work fetch that never idles a core when work is unavailable on a dedicated project but suspends the projects and then goes down the list from highest resource share to lowest and eventually grabs 1 temporary WU from the next project with work available to keep the cores from idling.
15) Message boards : Number crunching : Validate error's (Message 3557)
Posted 30 Dec 2015 by marmot
Jord(Ageless) suggested I move some of my WU's into a BOINC (those that aren't VBox themselves) virtual machine because the client doesn't handled app_config.xml limits on WU properly.

Instead of using a virtual machine I split BOINC into two clients running 4 and 2 cores on a 6 core machine in order to solve the work fetch issues of idle cores from high resource projects, without work, on that machine (it works really well now) but the WUProp monitoring one of the clients (I think the 4 core doing the most work) gives 90% validation errors even tho there are 30 minute WU's being done.



I am not sure which of these is the 2 core or the 4 core because the two clients are now nameless. Not sure how to fix that. So the two descriptions might actually be swapped.

installed directory client doing 2 cores and GPU
2nd client running 4 cores copied to a new directory running boinc.exe on a new port and manually connecting to it from a second boincmgr.exe process

If I understand correctly WUProp telnets into the client through the standard port instead of actually monitoring the RAM of the machine. Since the second client is moved to port 9000, is this the problem? Can I tell WUProp the new port number?




Home | My Account | Message Boards | Results


Copyright © 2018 Sebastien