Is any progress on this problem being made (or attempted)?
We have been working on this on and off for about a month now.
Under certain conditions (64Bit OS, recent NVidia drivers) the Client writes something into init_data.xml that causes a floating-point exception when the App tries to read this. This points to actually two problems, both in BOINC Code:
Steffen just tracked down what causes the BOINC Client to write some "trash" into the file. I hope that this will be fixed in the next BOINC Core Client that is released.
The second problem is that the BOINC API that gets linked into the App isn't robust enough to deal with that "trash". I think this is a serious problem, and would like to get this fixed, too, e.g. such that this App could run with older Clients. But currently I can't work on this myself (and it's not code I'm responsible for anyway).
Until these issues were fixed properly, either of the following workarounds should help:
- Use an older BOINC Core Client (I think 6.10.58 should work)
- If possible, run the 32Bit version of your current BOINC Client
- Try an older NVidia driver (would be nice if anyone could report here which does work)
I suspended 1.10 work on two hosts just to push a 1.11 task through early, not really expecting to see anything. On one host, that task ran a normal amount of time to complete, but generated error 14.
I'll quote what seems the most interesting bit of the stderr_txt:
[pre]2012-04-12 10:14:50.0065 (6112) [debug]: Successfully read checkpoint:676
% --- Cpt:676, total:676, sky:14/13, f1dot:1/52
2012-04-12 10:14:50.0065 (6112) [normal]: Finished main analysis.
2012-04-12 10:14:50.0065 (6112) [normal]: Recalculating statistics for the final toplist... 2012-04-12 10:14:50.0065 (6112) [CRITICAL]: Required frequency-bins [563193, 563208] not covered by SFT-interval [563441, 563859]
[Parameters: alpha:0, Dphi_alpha:5.632001e+005, Tsft:1.800000e+003, *Tdot_al:1.000023e+000]
XLAL Error - LocalXLALComputeFaFb (/home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/FDS_isolated/OptimizedCFS/LocalComputeFstat.c:553): Input domain error
LocalXALComputeFaFb() failed
Error[1] 5: function LocalComputeFStat, file /home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/FDS_isolated/OptimizedCFS/LocalComputeFstat.c, line 337, $Id$
ABORT: XLAL function call failed
XLALComputeExtraStatsSemiCoherent, line 363 : Failed call to LAL function ComputeFStat(). statusCode=5
Error in function XLALComputeExtraStatsForToplist, line 223 : Failed call to XLALComputeLineVetoSemiCoherent().
XLAL Error - XLALComputeExtraStatsForToplist (/home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/LineVeto.c:224): Internal function call failed: Input domain error
MAIN line 1790 : XLALComputeLineVetoForToplist() failed with xlalErrno = 1057.
2012-04-12 10:14:50.0377 (6112) [CRITICAL]: ERROR: MAIN() returned with error '14'
FPU status flags: PRECISION
2012-04-12 10:14:50.0377 (6112) [normal]: done. calling boinc_finish(14).10:14:50 (6112): called boinc_finish[/pre]
This host does not usually generate errors. But perhaps this was just random chance. Or Possibly the host somehow does not like application 1.11, or maybe I've got something fouled up on the system. Or, just maybe, application 1.11 has an undesired sensitivity of some kind. I'll post here in case someone else sees such a thing.
My quorum partner had no difficulty, though that one is a 64-bit Win7 host, while mine is a 32-bit Win XP Pro--and doubtless differs in a hundred other ways as well.
I have more 7.11 in stock on this host, and will push some more through now instead of waiting their turn.
We get a couple of these "Input domain error"s each day. So far I haven't been able to reproduce any of these. As you see from successful completions of your wingmen, this doesn't seem to be a problem with the data or the application by itself. I also didn't yet find a pattern in where this problem occurs (OS, CPU vendor etc.). Do you get these errors repeatedly on your computer, or just occasionally?
In any case it looks like the original problem (floating point exception in app_ipc.cpp) that the 1.11 Apps are targeted to solve is indeed fixed.
Do you get these errors repeatedly on your computer, or just occasionally?
So far as I am aware, this is the first error of this particular type I've seen on this computer, or any other. At the time I wrote the original post, I placed 1.10 work on suspend on the offending host. Since then it appears to have completed a dozen 1.11 tasks without further error, ten of which have already validated against a wingman. So I don't appear to have a usefully repeating issue on this host.
FWIW this "Input domain error" appears to be a checkpoint problem. Whenever a task is interrupted during "Recalculating statistics for the final toplist..." and has written a checkpoint right before this, the next restart will end with such an error. This is under investigation right now, we'll release a new app version as soon as this issue has been fixed.
RE: Is any progress on this
)
We have been working on this on and off for about a month now.
Under certain conditions (64Bit OS, recent NVidia drivers) the Client writes something into init_data.xml that causes a floating-point exception when the App tries to read this. This points to actually two problems, both in BOINC Code:
Steffen just tracked down what causes the BOINC Client to write some "trash" into the file. I hope that this will be fixed in the next BOINC Core Client that is released.
The second problem is that the BOINC API that gets linked into the App isn't robust enough to deal with that "trash". I think this is a serious problem, and would like to get this fixed, too, e.g. such that this App could run with older Clients. But currently I can't work on this myself (and it's not code I'm responsible for anyway).
Until these issues were fixed properly, either of the following workarounds should help:
- Use an older BOINC Core Client (I think 6.10.58 should work)
- If possible, run the 32Bit version of your current BOINC Client
- Try an older NVidia driver (would be nice if anyone could report here which does work)
BM
BM
This problem should be fixed
)
This problem should be fixed with the S6LV1 App 1.11 released today.
Note that the S6BucketA tasks are phased out anyway, I won't build new Apps for these old tasks.
BM
BM
RE: This problem should be
)
Sounds good, thanks.
I suspended 1.10 work on two
)
I suspended 1.10 work on two hosts just to push a 1.11 task through early, not really expecting to see anything. On one host, that task ran a normal amount of time to complete, but generated error 14.
I'll quote what seems the most interesting bit of the stderr_txt:
[pre]2012-04-12 10:14:50.0065 (6112) [debug]: Successfully read checkpoint:676
% --- Cpt:676, total:676, sky:14/13, f1dot:1/52
2012-04-12 10:14:50.0065 (6112) [normal]: Finished main analysis.
2012-04-12 10:14:50.0065 (6112) [normal]: Recalculating statistics for the final toplist... 2012-04-12 10:14:50.0065 (6112) [CRITICAL]: Required frequency-bins [563193, 563208] not covered by SFT-interval [563441, 563859]
[Parameters: alpha:0, Dphi_alpha:5.632001e+005, Tsft:1.800000e+003, *Tdot_al:1.000023e+000]
XLAL Error - LocalXLALComputeFaFb (/home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/FDS_isolated/OptimizedCFS/LocalComputeFstat.c:553): Input domain error
LocalXALComputeFaFb() failed
Error[1] 5: function LocalComputeFStat, file /home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/FDS_isolated/OptimizedCFS/LocalComputeFstat.c, line 337, $Id$
ABORT: XLAL function call failed
XLALComputeExtraStatsSemiCoherent, line 363 : Failed call to LAL function ComputeFStat(). statusCode=5
XLAL Error - XLALComputeExtraStatsSemiCoherent (/home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/LineVeto.c:364): Internal function call failed: Input domain error
Error in function XLALComputeExtraStatsForToplist, line 223 : Failed call to XLALComputeLineVetoSemiCoherent().
XLAL Error - XLALComputeExtraStatsForToplist (/home/bema/EinsteinAtHome/EinsteinAtHome/source/lalsuite/lalapps/src/pulsar/GCT/LineVeto.c:224): Internal function call failed: Input domain error
MAIN line 1790 : XLALComputeLineVetoForToplist() failed with xlalErrno = 1057.
2012-04-12 10:14:50.0377 (6112) [CRITICAL]: ERROR: MAIN() returned with error '14'
FPU status flags: PRECISION
2012-04-12 10:14:50.0377 (6112) [normal]: done. calling boinc_finish(14).10:14:50 (6112): called boinc_finish[/pre]
This host does not usually generate errors. But perhaps this was just random chance. Or Possibly the host somehow does not like application 1.11, or maybe I've got something fouled up on the system. Or, just maybe, application 1.11 has an undesired sensitivity of some kind. I'll post here in case someone else sees such a thing.
My quorum partner had no difficulty, though that one is a 64-bit Win7 host, while mine is a 32-bit Win XP Pro--and doubtless differs in a hundred other ways as well.
I have more 7.11 in stock on this host, and will push some more through now instead of waiting their turn.
We get a couple of these
)
We get a couple of these "Input domain error"s each day. So far I haven't been able to reproduce any of these. As you see from successful completions of your wingmen, this doesn't seem to be a problem with the data or the application by itself. I also didn't yet find a pattern in where this problem occurs (OS, CPU vendor etc.). Do you get these errors repeatedly on your computer, or just occasionally?
In any case it looks like the original problem (floating point exception in app_ipc.cpp) that the 1.11 Apps are targeted to solve is indeed fixed.
BM
BM
RE: Do you get these errors
)
So far as I am aware, this is the first error of this particular type I've seen on this computer, or any other. At the time I wrote the original post, I placed 1.10 work on suspend on the offending host. Since then it appears to have completed a dozen 1.11 tasks without further error, ten of which have already validated against a wingman. So I don't appear to have a usefully repeating issue on this host.
FWIW this "Input domain
)
FWIW this "Input domain error" appears to be a checkpoint problem. Whenever a task is interrupted during "Recalculating statistics for the final toplist..." and has written a checkpoint right before this, the next restart will end with such an error. This is under investigation right now, we'll release a new app version as soon as this issue has been fixed.
BM
BM