Tasks Errored out on android phone

nekomi_ch
nekomi_ch
Joined: 21 Apr 24
Posts: 3
Credit: 976700
RAC: 1393
Topic 231605

Basically the whole thing is as title suggested, my android phone is unable to complete tasks as it all errored out

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1042
Credit: 17764620
RAC: 12880

Hello

Hello Nekomi_ch,

Welcome to einstein@home forums!

It's easier for others here in the forum to give you advice on problems if you 'unhide' your computers. That is other e@h users can access basic information on type of hardware, BOINC & science app versions, finished and tasks in progress; last logfile (eventually contains error messages), etc.

To do so, look into e@h website preferences:

Account --> Preferences --> Privacy:

  • Should Einstein@Home show your computers on its website?: YES / NO
nekomi_ch
nekomi_ch
Joined: 21 Apr 24
Posts: 3
Credit: 976700
RAC: 1393

Thanks for your reply, I have

Thanks for your reply, I have just fixed the above issue, hope that helps

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1042
Credit: 17764620
RAC: 12880

Hmmm... your tasks run until

Hmmm... your tasks run until reaching 72,000 seconds runtime (20 hours); until then accumulated ~35,000 seconds CPU time but still did not finish. Checkpoints are written each 60..70 seconds. So there definitely is progress. The tasks are not stuck in some science app deadlock. As soon as 72,000 seconds runtime are reached the BOINC client terminates these long running tasks with:

compute error: 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED

So the runtime limit, configured in each e@h task, which is set when the workunit is generated, was exceeded.

The question now is: What is the runtime/CPU time your type of ARM client normally should require to finish such task? It is possibly an old, not so powerful ARM. I don't know.

You can see it the other way around; The runtime limits are too tight. (happened with upper memory bounds in the past for other science apps). BOINC runtime and memory bounds are intended to enforce termination of faulty (misconfigured) tasks or faulty science apps. These bounds are set when workunits are generated. Eventually the project admins have to increase them because BRP4 now requires more computation effort than years ago?

I have no experience with BRP4 on ARM. Hopefully some ARM crunchers can chime in here and give more informed advice.

Link
Link
Joined: 15 Mar 20
Posts: 121
Credit: 9388434
RAC: 42946

Same issue as here, basically

Same issue as here, basically the CPU benchmarks are completely wrong and claim that your Android is nearly as fast as my Ryzen 5700G. Solution as in the other thread.

.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4959
Credit: 18638928756
RAC: 5355359

Happy that someone else has

Happy that someone else has called out the very broken Boinc client cpu benchmarks algorithm that is incapable of distinguishing true cpu performance levels among different cpu architectures.

This unfortunately skews the true cpu FLOPS performance calculations which are incorporated into the estimated task times in the client and leads to 'exceeded time limit' errors aplenty.

 

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1042
Credit: 17764620
RAC: 12880

Hmmm, I would like to point

Hmmm, I would like to point out that the boundaries (fpops, memory, disk) for the BOINC client, defined in the workunit header are independent of any CPU flops or benchmarks of a specific CPU or host. These are set when the workunit is generated and are intended to ensure that the task will be terminated in the event of unexpected errors, faulty workunits or science apps, whatever. The boundaries are therefore set to a multiple of the estimated fpops, memory and disk requirements (example below).

BRP4 example:

<workunit>
    <name>p2030.1729685405.G46.27-03.52.N.b2s0g0.00000_3265</name>
    <app_name>einsteinbinary_BRP4</app_name>
    <version_num>170</version_num>
    <rsc_fpops_est>17500000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>350000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>260000000.000000</rsc_memory_bound>
    <rsc_disk_bound>20000000.000000</rsc_disk_bound>

  • Fpops estimate is 17,500 billion. Fpops limit is 350,000 billion which is TWENTY times the Fpops estimate

Twenty times more should be sufficient if the FPOPS estimate is anywhere on the same scale than actually required Fpops.

The question is... how does the BOINC client derives ~72,000 seconds from the specified FPOPS bounds given in the workunit header, which is roughly the time span after which each task of Nekomi_ch's ARM was terminated by BOINC. There is no time limit given in workunit headers, only FPOPS limits. How can there be an EXIT_TIME_LIMIT_EXCEEDED? Where or when is a time limit set? By workunit generators? Where is such time limit defined? I can't find it in BOINC client's XML files...

  • Memory limit is 260 MB. Such BRP4 task requires ohhhh... 390 MB running on my Intel iGPU.

So, the memory boundary doesn't fit to this BRP4 workunit but is only slightly off. There will be no problems with memory as long as the host still provides free memory / doesn't operate near memory exhaustion.

The BOINC client is patient not too terminate such tasks that violate memory bounds by a fraction of specified bounds. Only when memory consumption reaches double the boundary the task will be terminated with a memory limits error. (I observed and documented this behaviour already in the forum for the former O2 gravitational wave CPU tasks which had way to small memory bounds in the beginning).

  • Disk limit is 20 MB which is roughly TEN times the required disk space. A BRP4 task and its temporary files needs less than 2 MB.
Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1042
Credit: 17764620
RAC: 12880

the old quote from

the old quote from 2022-12-21, user 'Link' referred to:

Link wrote:

EXIT_TIME_LIMIT_EXCEEDED means the task has run a lot slower than BOINC has expect it to run. When looking at the details of that host, I see insane measured speed values, that's the cause. Rerun benchmarks, if that doesn't change them to something more realistic, change them manually in client_state.xml to 1/100th of the current values and disable benchmarks in cc_config.xml, DCF will do the fine tuning.

I always thought that the integer and floating point benchmarks, determined by the BOINC client, then transmitted with requests to the BOINC server and stored there for each host (updated regularly) are used by the server-side scheduler to calculate how many tasks will be assigned to a client that requests NNNN seconds of work.

This should be independent of any TIME_EXIT_LIMITS. Managing the amount of assigned work to different hosts at the server side and enforcing faulty tasks resp. apps to terminate at the client side are two different things... based on my gut feeling.

I thought it is the project admins or project scientists, who for each different science app they operate, make an educated guess... How many FPops will a task of this type require? (which necessitates as Keith Myers objects: the client benchmarking must not be broken). These educated guessing results are then configured in the workunit generator and leads to the fpops estimates and bounds in workunit headers. (surely more complex in reality as different task parameters lead to different fpops/memory requirements).

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4959
Credit: 18638928756
RAC: 5355359

The issue is with the work

The issue is with the work generator template values that the scientist/admin uses for the app tasks.  That is rsc_fpops_est value or GFLOPS estimated to be required to crunch the task.

But also apparently there can be an additional value in the template I was unaware of until discussion with the scientist/admin of Gaia@home who has been dealing with severe exceeded time limit errors on 90% of their recent work units by 90% of the hosts attempting to crunch the work.  There is also an app_speed value that affects whether a task is past the expected time limit to crunch.

Normally, the GFLOPS of a task is simply divided by the APR rated of the app on the host.  That provides the estimated time remaining function and the progress rate in the client. The APR rated is developed by returned and validated tasks based on the p_fiops and p_iops values that are calculated by the host benchmarks and stored in the client_state.xml file.

When those values are way out of line with the actual task crunching speeds, that is when the exceeded time limits cause errors.  But this app_speed variable is another factor it appears. From what I can figure out it is a profile of the host output over a 24 hour day since the value of 86400 keeps coming up in the stderr.txt outputs of the errored tasks.

So when benchmarks are way out of line, then the chances of hitting the time exceeded errors are greatly enhanced.

 

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1042
Credit: 17764620
RAC: 12880

So, the takeaway from Link's

So, the takeaway from Link's and Keith's detailed posts (thanks!):

If tasks are aborted by BOINC client due to time limit reached, then redo benchmarks and check if results are reasonable. If not, correct manually.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4959
Credit: 18638928756
RAC: 5355359

The scientist/admin needs to

The scientist/admin needs to input rational values into the work unit generator template to allow 99% of all the hosts to successfully complete the task before deadline.  They need to set rational deadline limits too for the majority of hosts running on their project.

Also without 11 validated tasks being returned by the host, the host can never develop the endpoint APR rate of the host on the app.  Without a proper APR rate, the host will never get the progress rate and estimated time remaining values correct.  With nothing but exceeded time limit errors, the APR will never be developed or correct.

The solution to this problem is to manually edit the p_iops and p_fiops values in the client_state.xml file and move the values there over by one or two decimal positions to make the host processing speed smaller. That will extend the extrapolated time limit permitted to complete the task.

If you have never run the benchmarks on the client, then the client uses default values for both floating and integer performance.  That is set at 1B ops/sec for both. Those values are pretty safe for all older and current hardwares.

From https://github.com/BOINC/boinc/blob/26ce93e0b1f358b501b389b7e790b781945562b4/html/inc/host.inc#L152-L158

[Edit]

The only other ways to solve the problem is convince the scientist to change his template input values OR manually edit each tasks input or properties values in the client_state.xml file.  Manually changing the task property values is impossibly tedious for every task and one incorrect edit will corrupt the file and dump all your cache so not recommended for that option.

Though this can be done for a task in offline mode for testing or experiments to determine what the correct rsc_fpops_est value should be.  This involves copying a task to a test directory along with the client and then editing the input file to new values and then manually running the client in standalone mode in a terminal and see if the task completes normally or not.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.