Nightly checks to make for RecExCommission running

13.2.0.Y branch

  • Check nightly compiled at NICOS page here
  • Check ATN tests ran OK (both data and Sim) from here
  • If failed look at logfile and figure out why
  • Check RTT test (both data and sim) i look on afs at:
/afs/cern.ch/atlas/project/RTT/Results/rel_3/13.2.0.Y/build/i686-slc4-gcc34-opt/point1/RecExCommission/AthenaTestingRecExCommission/RecExCommission_ATN/1/RecExCommission_ATN1_log
/afs/cern.ch/atlas/project/RTT/Results/rel_3/13.2.0.Y/build/i686-slc4-gcc34-opt/point1/RecExCommission/AthenaTestingRecExCommissionSim/RecExCommission_ATN_sim/2/RecExCommission_ATN_sim2_log
  • the produced root files are also produced in that dir as well as the perfmon pdf file this stuff is also accesible on the web at:
here (data) here (sim)
  • We are currently trying to get the RTT monitoring.root histograms displayed on the web in the same way as the M6 histograms were. this isnt quite working yet but when it is the histograms will be displayed here

14.0.10.Y branch

  • Check nightly compiled at NICOS page here
  • Check ATN tests ran OK (both data and Sim) from here
  • If failed look at logfile and figure out why
  • Check RTT test (both data and sim) i look on afs at:
/afs/cern.ch/atlas/project/RTT/Results/rel_3/14.0.10.Y/build/i686-slc4-gcc34-opt/point1/RecExCommission/AthenaTestingRecExCommission/RecExCommission_ATN/0/RecExCommission_ATN0_log
/afs/cern.ch/atlas/project/RTT/Results/rel_3/14.0.10.Y/build/i686-slc4-gcc34-opt/point1/RecExCommission/AthenaTestingRecExCommissionSim/RecExCommission_ATN_sim/1/RecExCommission_ATN_sim1_log
  • If the RTT failed i usually run it by hand in the gencomm w0/jboyd directory in the atlasgeneral queue turning off the system that caused the crash - in order to see if everything else works ok.
  • Of course i send an email to the person responsible for the crash asking them to fix it.
  • the produced root files are also produced in that dir as well as the perfmon pdf file this stuff is also accesible on the web at:
here (data) here (sim)
  • The web display for the 14.0.10.Y histograms wont work until after the 13.2.0.Y one is working...
  • If the RTT failed i usually run it by hand in the gencomm w0/jboyd directory in the atlasgeneral queue turning off the system that caused the crash - in order to see if everything else works ok.
  • Of course i send an email to the person responsible for the crash asking them to fix it.

14.1.0 branch

  • We have the ATN and RTT also running in dev and devval.
  • recently these have always failed but i normally send an email to the person i think is responsible asking when that should be fixed (the problem is moving around so maybe we are converging???).

Current status of nightlys

13.2.0.Y

  • Data reco runs fine
  • Sim reco crashes due to L1Calo abort in first event
retrieve(const): No valid proxy for object JetElements  of type
DataVector<LVL1::JetElement>(CLID 6203)
JetTrigger                                                  WARNING Failed
to load any JetElements - abort processing
AthenaEventLoopMgr                                             INFO
Execution of algorithm JetTrigger failed with StatusCode::FAILURE
  • If you turn L1Calo off it crashes due to MuonMon when the run number changes (Nectorias contacted)
ToolSvc.TGCRawDataValAlg                                       INFO TGC
RawData Monitoring Histograms being booked
ToolSvc.TGCRawDataValAlg                                       INFO TGC
RawData Monitoring Histograms booked
FATAL 2008-Apr-07 17:25:47 [static void
ers::ErrorHandler::SignalHandler::action(int, siginfo_t*, void*) at
ers/src/ErrorHandler.cxx:88] Got signal 11 Segmentation fault (invalid
memory reference)

14.0.10.Y

  • Data reco crashes from HLT Mon (experts informed)
ToolSvc.BSMon                                               WARNING Found
BCID offset of 4 between ROD Header (890) and data (886)
ToolSvc.BSMon                                               WARNING CTP
error status word #0: 0x1
FATAL 2008-Apr-10 09:50:36 [static void
ers::ErrorHandler::SignalHandler::action(...) at
ers/src/ErrorHandler.cxx:88] Got signal 11 Segmentation fault (invalid
memory reference)
  • turning off HLT it runs OK.
  • Sim reco crashes due to L1Calo
retrieve(const): No valid proxy for object JetElements  of type
DataVector<LVL1::JetElement>(CLID 6203)
JetTrigger                                                  WARNING Failed
to load any JetElements - abort processing
AthenaEventLoopMgr                                             INFO
Execution of algorithm JetTrigger failed with StatusCode::FAILURE
  • If you turn L1Calo off it crashes due to MuonMon when the run number changes (Nectorias contacted)
ToolSvc.TGCRawDataValAlg                                       INFO TGC
RawData Monitoring Histograms being booked
ToolSvc.TGCRawDataValAlg                                       INFO TGC
RawData Monitoring Histograms booked
FATAL 2008-Apr-07 17:25:47 [static void
ers::ErrorHandler::SignalHandler::action(int, siginfo_t*, void*) at
ers/src/ErrorHandler.cxx:88] Got signal 11 Segmentation fault (invalid
memory reference)

dev

  • ???

RecExCommission developments

Short timescale things to do

  • Add these changes to 14 version of RecExCommission (i already added these to 13 version)
        * turn off HLT if reading from ESD or filteredESD
        * changed ReadBSusingTag to use COMCOND-003-00
        * added Thijs to RTT/ATN mails
        * added DQWebDisplay to RTT test
  • Make RecExCommission_Tier0.py setup for muon week.
  • Fix the RTT DQWebDisplay runner so we get the monitoring histograms from RTT displayed on web
  • Warnings compiling SCTMon (yellow on NICOS page) i emailed Martin about this
  • Add new TAG for TileCosmicMuon->GetFitQuality() >0 (according to Jose this is a better TileMuon fitter candidate)
  • Revert Hongs changes to RecExCommission_topOptions (in 13 branch only) which change what tags are written depending on which detectors are enabled (this removes some warnings - but changes the shape of the TAG file)
  • Get RecExCommission job transform running for M6 reprocessing (Luis)
  • Test running using TAG from ESD or Raw data - see David Malon's wiki on this here i tested running with the tag on RAW data and it seemed to work ok
  • Is Muon ESD problem fixed??
  • Check that all JO and settings in 14 branch of RecExCommission are same as in 13 version - there are some JO missing eg. RecExCommission_ReadBSusingTag.py is not in 14 version. and some have different settings. This is important to sort out **
  • On this note the tag in 14.0.10 doesnt contain TileMuonFitter or L1 bits - looks like a JO issues in RecExCommission_topOptions.py
  • CTP crash if: CTPFlags.doMuRIO = True (this is set to False in RecExCommission_topOptions.py to stop memory problem - Daniel Sherman is hopefully following up on this)
  • ERROR messages / WARNINGS
in 14, many, many WARNINGs from L1Calo on every event:
TriggerTowerMaker                                           WARNING Index for sinThetaHas is invalid. Ieta: 51
TriggerTowerMaker                                           WARNING calib: calclate index for  eta value: 1.9125wrong: 4

in sim, many, many Warnings from muons on every event:
DataProxy                                                   WARNING accessData:  IConversionSvc ptr not set


in everything (sim/data 13 /14):
ToolSvc.RPCRawDataValAlg                                      ERROR  Cannot retrieve the RPC cluster container

  • crash in simulation (when run number changes) from muon monitoring (Nectorios informed) like
accessData:  IConversionSvc ptr not set
ToolSvc.TGCRawDataValAlg                                       INFO TGC
RawData Monitoring Histograms being booked
ToolSvc.TGCRawDataValAlg                                       INFO TGC
RawData Monitoring Histograms booked
FATAL 2008-Apr-09 08:23:14 [static void
ers::ErrorHandler::SignalHandler::action(...) at
ers/src/ErrorHandler.cxx:88] Got signal 11 Segmentation fault (invalid
memory reference)
  • Unchecked Status Codes:
Num | Function                       | Source Library
----+--------------------------------+------------------------------------------
  2 | CscRdoByteStreamTool<ROBData_T<eformat::ROBFragment<unsigned intconst*>, unsigned int const*>,CscRawDataCollection>::interfaceID(CscRawDataCollection*) |libMuonByteStream.so
9041 | RpcClusterBuilderPRD::push_back(Muon::RpcPrepData*) |libRpcClusterization.so
  • understand timing of when RTT tests finish. (its important they finish early enough so thy can be used to deteremine if a release is OK to goto p1 / tier0 that morning - not always the case)

Longer timescale thinsg to do

i will write a wiki on this and put a link here
  • Run valgrind over job to look for mem corruption
  • Add new tags eg. SP multiplicity, Muon occupancy, total Calo energy, ???

Things to lookout for

  • ERRORs/ FATALs/ WARNINGs
  • unchecked SCs
  • Mem leaks (run perfmon over the same events more than once to avoid caching effects)
  • Large CPU usage on some events...
  • does it crash if it runs on 0 events? (SCTMon used to??)
  • does it crash if it runs on different runs in same job? needed for tag running and simulation
  • can we run with all systems on on data with just one system in??

List of exeperts for different bits of code

-- JamieBoyd - 09 Apr 2008

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2008-04-10 - JamieBoyd
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback