It's only these last few words that I believe are incorrect :-). You certainly do lose 'wall-clock time' but my understanding is that the values reported back to the project don't show that. When a saved checkpoint is reloaded, so too are the CPU time and elapsed time values for when the checkpoint was written to disk. You could be restarting days later but the previous values will be used and incremented further to give the final totals.
Before making my previous response, I had checked one of George's results for comparison with one of Alex's. It was obvious that the Alex results were stopping and starting repeatedly and the George result wasn't. However that difference can't be the cause of the extraordinarily long times being reported.
I used this George result where 60 checkpoints had been used in a total time of around 12hrs 23mins. That's just over 12 mins per checkpoint. I then found this Alex result which had many stops and restarts. However the key thing I was looking for was the final restart and the run to the finish, since that would have timestamps that would allow the elapsed time for a checkpoint to be measured. I've included below, the actual part of the log that shows these timestamps. The first snip is the restart line with its timestamp:-2020-12-31 18:57:04.8502 (14881) [debug]: Successfully read checkpoint:53After that, there is a single line of dots for the final checkpoint and then the finish timestamp:- 2020-12-31 20:31:15.1325 (14881) [normal]: Finished main analysis.The difference between the two timestamps is more than 1.5 hours compared to 12 mins for a George result.
This is not time being lost by wrong settings. This is a processor crawling instead of running. There's got to be a reason for it.
I gave a link to the full Alex result from which those two timestamps were extracted. If you go right to the bottom of the complete log and look at what came after the finish line, you'll see mention of "Recalculating statistics for the final toplist..." This is something extra that is done at the very end to assemble a top ten list of candidate signals. It should take maybe a minute or so. Here it shows over 18 mins. No checkpoints to worry about during this time.
The 'details' pages for the Alex and George machines show exactly the same processor and RAM, but different OS's. The BOINC benchmark values are rather different - the Alex machine has higher numbers which seems weird. I wonder if something is wrong with Alex's Fedora install.
Maybe we should retitle this thread or start a new one. :) The conversation has certainly morphed into something quite different, and I'd be thrilled if jobs ran seven times as fast as they do now. (OTOH, starting new threads tends to cause fragmentation of conversations, and that hurts everyone.)
I've watched my CPU churning via the System Monitor, and it's working its tail off all the time, which I expected is normal for BOINC. My 32 cores all stay in the 80% to 100% range consistently, and no, I usually leave the machine idle with only BOINC (and the underlying terminal and OS) running. Except during the day when I waste time doing other things and I drive the box directly. (That might explain some of the stops and starts you see.)
Fedora 33 was released in October 2020, and I did a straight upgrade from Fedora 32, so I can't blame the company that built the machine for that.
The good news, Gary, is that I had this machine built for me to compile Mozilla Firefox code, and it does do that pretty quickly. I'm a software engineer, so if there's a performance problem I can at least try to compile it and see what's going on. My profiling skill set is weak, though.
The bad news is that I'm not all that great with hardware. Also, my vacation time is ending.
Bottom line, I'm willing to put in some evening (Pacific time zone) and weekend time to help get to the bottom of this. But I'd also need to be put in touch with BOINC developers who know their stuff as well.
Where are the controls for suspending tasks by BOINC? ->Options ->Computing Preferences ->Computing. Where you will see a whole section on "When to suspend." I forget what the default settings are, but possibly you have chosen (intentionally or not) "Suspend when computer is in use" and depending on the other parameters in that section your E@H CPU tasks could end up suspended for something as innocous as moving the mouse.
That was marked checked, so I have just unchecked it. Fingers crossed.
To answer another question on this thread, I have a 2TB SSD drive installed.
How many BOINC tasks are you running at the same time? I don't know what the memory footprint of a GW CPU task is (as I don't run them) and I don't know how much of a cache-hog it might be, but it is possible that you've got so much going on on the machine that the CPUs look busy but are actually stalling out waiting for RAM access! (I had that happen on a 3700X (32GB) with certain workload mixes [WCG and CPDN CPU jobs, not Einstein...])
I'm not saying that is what's happening, but it's relatively easy to test - if you currently allow BOINC to have (say) 87.5% of the CPUs, reduce that to (say) 50% and see if jobs seem to process considerably quicker!
I seem to remember a thread or two over at SETI@home where people were trying to work out the optimum work load on Ryzens (Threadrippers &c) with 32 "CPUs" and I think they came out around not using many more than 24 at a time; the machines actually had better total throughput when apparently being under-loaded! I can't remember how much memory some of those had, but the bottleneck would probably be access rather than quantity.
Uh, the default, which turned out to be 1 per CPU, so 32.
I'd definitely turn that down a bit; I suspect that performance-monitoring tools would indicate that your CPUs are running at reasonable clock rates but not processing anywhere near as many instructions per cycle as they could do if less heavily loaded - there's memory to fight over, and the operating system needs a share of the machine as well!!!
I've found that my 3700X is happiest if I leave two or three notional CPUs free for the various external tasks on the machine, then I need one for my GPU and the remaining 12 get to run CPU stuff for WCG and CPDN. So I've got mine set to 81.75% of CPUs (13 out of 16).
Perhaps try 75% for starters? If that's a dramatic improvement, let it run a few more, but probably no more than 28.
I don't know whether Einstein CPU jobs update the Progress and time figures properly in BOINC Manager (the GPU ones seem to...) -- if they do, you should be able to get an idea of whether there's an improvement even before any tasks finish!
On a separate note, if you're planning on letting BOINC run in the background whilst using it as a development platform, check your total memory usage that is due to BOINC. You may think you've got a lot of RAM, but how much do your work tools need???
Uh, the default, which turned out to be 1 per CPU, so 32.
You actually only have 16 physical cores (32 threads - ie. 'virtual' cores) so allowing 32 full CPU tasks may well be the whole problem. Try setting BOINC to use 50% of the processors as a quick check to see if those 16 will run at the proper speed.
I don't run CPU tasks at all and I don't use high core count processors, so I have no experience with these beasts. I'm using a quad core Athlon 3000G with a $120 GPU (RX 570 8GB) which can complete a full GW GPU task (same work content and credit as the CPU tasks) in 9 mins each one (4 tasks run concurrently on the GPU in less than 36 min for the lot). I get 32 tasks every 4.8 hours. These budget machines only have 8GB RAM and run cool and fast.
For my Ryzen 3900X which is a CPU only cruncher I limit Einstein to half, I use an app_config with a project max concurrent of 12. Some projects work well on all cores but the Einstein and CPDN ones don’t. With 32 threads going you are going to have a bottleneck on the memory access, it only has 2 channels on the Ryzen 3000 and 5000 chips.
Whilst the latest results are much faster, there still must be something slowing things down.
The latest GW crunch times have dropped from ~260-280K to around 120Ksecs. By using just half the threads, you are now getting more output than you were previously.
You still have to work out why George's identical CPU takes around 50K (when I last looked) but yours is still more than twice that. Do you have other multi-thread, high compute load jobs running a lot of the time?
You have a few GRP CPU tasks as well and the latest result there has a much better improvement - from 77K for a result on Jan 2 to just 27K for the latest one. Based on that value (and the fact that GW tasks should be less than twice the time for GRP tasks), you really should be able to get times of around 50K rather than 120K for GW tasks.
Eugene Stemple wrote:...
)
It's only these last few words that I believe are incorrect :-). You certainly do lose 'wall-clock time' but my understanding is that the values reported back to the project don't show that. When a saved checkpoint is reloaded, so too are the CPU time and elapsed time values for when the checkpoint was written to disk. You could be restarting days later but the previous values will be used and incremented further to give the final totals.
Before making my previous response, I had checked one of George's results for comparison with one of Alex's. It was obvious that the Alex results were stopping and starting repeatedly and the George result wasn't. However that difference can't be the cause of the extraordinarily long times being reported.
I used this George result where 60 checkpoints had been used in a total time of around 12hrs 23mins. That's just over 12 mins per checkpoint. I then found this Alex result which had many stops and restarts. However the key thing I was looking for was the final restart and the run to the finish, since that would have timestamps that would allow the elapsed time for a checkpoint to be measured. I've included below, the actual part of the log that shows these timestamps. The first snip is the restart line with its timestamp:-
2020-12-31 18:57:04.8502 (14881) [debug]: Successfully read checkpoint:53
After that, there is a single line of dots for the final checkpoint and then the finish timestamp:-2020-12-31 20:31:15.1325 (14881) [normal]: Finished main analysis.
The difference between the two timestamps is more than 1.5 hours compared to 12 mins for a George result.This is not time being lost by wrong settings. This is a processor crawling instead of running. There's got to be a reason for it.
I gave a link to the full Alex result from which those two timestamps were extracted. If you go right to the bottom of the complete log and look at what came after the finish line, you'll see mention of "Recalculating statistics for the final toplist..." This is something extra that is done at the very end to assemble a top ten list of candidate signals. It should take maybe a minute or so. Here it shows over 18 mins. No checkpoints to worry about during this time.
The 'details' pages for the Alex and George machines show exactly the same processor and RAM, but different OS's. The BOINC benchmark values are rather different - the Alex machine has higher numbers which seems weird. I wonder if something is wrong with Alex's Fedora install.
Cheers,
Gary.
Maybe we should retitle this
)
Maybe we should retitle this thread or start a new one. :) The conversation has certainly morphed into something quite different, and I'd be thrilled if jobs ran seven times as fast as they do now. (OTOH, starting new threads tends to cause fragmentation of conversations, and that hurts everyone.)
I've watched my CPU churning via the System Monitor, and it's working its tail off all the time, which I expected is normal for BOINC. My 32 cores all stay in the 80% to 100% range consistently, and no, I usually leave the machine idle with only BOINC (and the underlying terminal and OS) running. Except during the day when I waste time doing other things and I drive the box directly. (That might explain some of the stops and starts you see.)
Fedora 33 was released in October 2020, and I did a straight upgrade from Fedora 32, so I can't blame the company that built the machine for that.
The good news, Gary, is that I had this machine built for me to compile Mozilla Firefox code, and it does do that pretty quickly. I'm a software engineer, so if there's a performance problem I can at least try to compile it and see what's going on. My profiling skill set is weak, though.
The bad news is that I'm not all that great with hardware. Also, my vacation time is ending.
Bottom line, I'm willing to put in some evening (Pacific time zone) and weekend time to help get to the bottom of this. But I'd also need to be put in touch with BOINC developers who know their stuff as well.
Eugene Stemple wrote:Where
)
That was marked checked, so I have just unchecked it. Fingers crossed.
To answer another question on this thread, I have a 2TB SSD drive installed.
Alex, How many BOINC tasks
)
Alex,
How many BOINC tasks are you running at the same time? I don't know what the memory footprint of a GW CPU task is (as I don't run them) and I don't know how much of a cache-hog it might be, but it is possible that you've got so much going on on the machine that the CPUs look busy but are actually stalling out waiting for RAM access! (I had that happen on a 3700X (32GB) with certain workload mixes [WCG and CPDN CPU jobs, not Einstein...])
I'm not saying that is what's happening, but it's relatively easy to test - if you currently allow BOINC to have (say) 87.5% of the CPUs, reduce that to (say) 50% and see if jobs seem to process considerably quicker!
I seem to remember a thread or two over at SETI@home where people were trying to work out the optimum work load on Ryzens (Threadrippers &c) with 32 "CPUs" and I think they came out around not using many more than 24 at a time; the machines actually had better total throughput when apparently being under-loaded! I can't remember how much memory some of those had, but the bottleneck would probably be access rather than quantity.
Just a suggestion, which I hope might help...
Cheers - Al.
Uh, the default, which turned
)
Uh, the default, which turned out to be 1 per CPU, so 32.
Alex Vincent wrote: Uh, the
)
I'd definitely turn that down a bit; I suspect that performance-monitoring tools would indicate that your CPUs are running at reasonable clock rates but not processing anywhere near as many instructions per cycle as they could do if less heavily loaded - there's memory to fight over, and the operating system needs a share of the machine as well!!!
I've found that my 3700X is happiest if I leave two or three notional CPUs free for the various external tasks on the machine, then I need one for my GPU and the remaining 12 get to run CPU stuff for WCG and CPDN. So I've got mine set to 81.75% of CPUs (13 out of 16).
Perhaps try 75% for starters? If that's a dramatic improvement, let it run a few more, but probably no more than 28.
I don't know whether Einstein CPU jobs update the Progress and time figures properly in BOINC Manager (the GPU ones seem to...) -- if they do, you should be able to get an idea of whether there's an improvement even before any tasks finish!
On a separate note, if you're planning on letting BOINC run in the background whilst using it as a development platform, check your total memory usage that is due to BOINC. You may think you've got a lot of RAM, but how much do your work tools need???
Happy crunching, and find a Gravity Wave!
Good luck - Al.
Alex Vincent wrote:Uh, the
)
You actually only have 16 physical cores (32 threads - ie. 'virtual' cores) so allowing 32 full CPU tasks may well be the whole problem. Try setting BOINC to use 50% of the processors as a quick check to see if those 16 will run at the proper speed.
I don't run CPU tasks at all and I don't use high core count processors, so I have no experience with these beasts. I'm using a quad core Athlon 3000G with a $120 GPU (RX 570 8GB) which can complete a full GW GPU task (same work content and credit as the CPU tasks) in 9 mins each one (4 tasks run concurrently on the GPU in less than 36 min for the lot). I get 32 tasks every 4.8 hours. These budget machines only have 8GB RAM and run cool and fast.
Cheers,
Gary.
For my Ryzen 3900X which is a
)
For my Ryzen 3900X which is a CPU only cruncher I limit Einstein to half, I use an app_config with a project max concurrent of 12. Some projects work well on all cores but the Einstein and CPDN ones don’t. With 32 threads going you are going to have a bottleneck on the memory access, it only has 2 channels on the Ryzen 3000 and 5000 chips.
BOINC blog
I've been running for a
)
I've been running for a couple of days at 50% the number of threads. How's it look now?
Alex Vincent wrote:How's it
)
Whilst the latest results are much faster, there still must be something slowing things down.
The latest GW crunch times have dropped from ~260-280K to around 120Ksecs. By using just half the threads, you are now getting more output than you were previously.
You still have to work out why George's identical CPU takes around 50K (when I last looked) but yours is still more than twice that. Do you have other multi-thread, high compute load jobs running a lot of the time?
You have a few GRP CPU tasks as well and the latest result there has a much better improvement - from 77K for a result on Jan 2 to just 27K for the latest one. Based on that value (and the fact that GW tasks should be less than twice the time for GRP tasks), you really should be able to get times of around 50K rather than 120K for GW tasks.
Cheers,
Gary.