Debug Stream

Data Streaming

There are 4 streams of data generated based on the trigger decision:

Dot graph left-down The physics stream, that contains all the data that will be used for physics analyses; the data is divided in luminosity blocks;

Dot graph left-down The calibrations stream, that contains partially built events delivering the minimum amount of information needed for detector calibrations. Higher rates than the physics stream rates are achieved that way;

Dot graph left-down The express stream, that contains full events for fast reconstruction for monitoring and data quality purposes; and

Dot graph left-down The debug stream, that contains events for which the trigger was not able to make a decision, because those events caused failures in some part of the online system.

Debug sub-streams

Dot graph left-down L2ForcedAccept: contains L2 crashes or time-outs; the event does not contain a L2 result and reprocessing it usually leads to recovery

Dot graph left-down EFD: contains EF crashes or time-outs; the event does not contain EF result, and may or not contain a L2 result. Reprocessing usually leads to recovery

Dot graph left-down HLT_ERROR: contains algorithm errors at L2 or EF; the event may or not contain L2 or EF results.

Debug Stream Handling

The main purpose of dealing with the debug stream is to identify problems with the trigger system as soon as possible and reduce the turn-around time for fixing these problems. To achieve that, there is a procedure followed in all events of the debug stream of each run, once the run is finished or more often. The goal is to achieve quasi real-time handling of the debug stream, by treating it in a file by file basis.

Note: This is work-in-progress so what is described in this wiki page is not finalized!

Procedure

Dot graph left-down Summarize the types of errors and according to them make files with event samples (Error Analysis on debug stream events)

Dot graph left-down Reprocess all runs with L2ForcedAccept or EFD (Error Recovery)

Dot graph left-down Summarize trigger information of recovered events (Error Analysis on recovered events)

Dot graph left-down Report on algorithm problems and dataflow problems

Dot graph left-down Notify experts

A framework has been developed for an automated handling of this procedure.

The framework

There are two main functionalities of the Debug Stream (DS) handling; the analysis of the DS events and the recovery of the L2ForcedAccept and EFD events. Those functionalities have common sources of information and they both feed reports of algorithms and dataflow problems. All the handling is done within a framework that treats the analysis and the recovery as two different use cases. The principle of the framework structure is pictured below.


framework.png

The input configuration provides the manager with the input file castor paths (by performing queries in the SFO-Tier0 DB) as well as the trigger configuration (HLT/TDAQ release, SM Key and L1/HLT prescale keys, information that comes from the Trigger DB). It generates the output paths where all information is summarized. The output castor paths, where the recovered files and error event samples are stored are also defined there. The recovered files are saved in the recovered stream, in /castor/cern.ch/grid/atlas/DAQ/recovery/$year/$run_number. The error event samples are stored temporarily in /afs/cern.ch/t/trigcomm/$year/$run_number/, where they can be accessed by the experts for further studies.

The analysis and recovery jobs are submitted to the CAF, in the atlastrig queue, dedicated to HLT operations. The presentation of the resulting information is done in a web-server. Summary plots from the analysis of the debug stream errors and from the recovered events are displayed using root2html. We also provide summary log files and more information that is explained later.

A summary of the input and output flow of data and information is given below:


data:infoFlow.png

Error Analysis

The Error Analysis is done using the information of the event header, always present no matter what happened in the event. For the cases that the event data (L2 and/or EF data) are saved in the event, the initBits of the data are analyzed together with the L2/EF results.

The event data is stored in the so called e-format, described in this EDMS document.

With the information from the event header and/or data, plots that summarize the errors and other characteristics of the debug stream events are made. We also write out event samples according to the error that occurred in the event. The event samples contain a handful of events for the experts to study.

Event header

The header of the event contains the run number, the stream tag, the lvl1_id, the global_id, the lumiblock number etc.

The global_id is a counter assigned by the Data Flow Manager to every L2 selected event. It is a monotonically increasing number within the run (for the events for which the L2 selection was successful).

The lvl1_id corresponds to the "extended lvl1 ID". It is a 32 bit word composed of two parts; a 24 bit L1 ID which is formed in the TTCs and a 8 bit ECR ID (Event Counter Reset ID) which is implemented in the ROD.

initBits

They are the first 15 bits of the data, L2 or EF. They are containing the HLT result and other information.

The initBits (defined here) are listed below:

278     enum InitBits { IndHLTResultClassVersion = 0,
279                     IndEventNumber,        //!< event number (from EventInfo::EventID::event_number())
280                     IndHLTDecision,        //!< HLT decision (== 0 if event has been rejected at HLT)
281                     IndPassThrough,        //!< has the event been forced (passed through)
282                     IndHLTStatus,          //!< HLT status corresponding to ErrorCode enums
283                     IndLvlConverterStatus, //!< LvlConverter status corresponding to ErrorCode enums
284                     IndHLTLevelInfo,       //!< the HLT level
285                     IndNumOfSatisfiedSigs, //!< number of satisfied signatures
286                     IndErrorInChain,       //!< chain ID in which the error occured, in normal conditions sbould be 0
287                     IndErrorInStep,        //!< step number in which error occured, in normal conditions sbould be 0
288                     IndCreatedOutsideHLT,  //!< also an error identifier
289                     IndHLTResultTruncated, //!< the serialize function could not fit everything into the given max data_size
290                     IndConfigSuperMasterKey, //!< configuration key for the menu
291                     IndConfigPrescalesKey,   //!< configuration key for prescales                   
292                     IndNumOfFixedBit       //!< total number of fixed bits
293     };

L2 Result (3 status words)

Dot graph left-down 1st Word: OK or Error

Dot graph left-down 2nd Word: Type of HLT result produced

HLT_L2_Status_Names = ['Normal Lvl2',
                       'Dummy Lvl2',
                       'Normal Truncated',
                       'Dummy Truncated',
                       'New Status1',
                       'New Status2']

Dot graph left-down 3rd Word: Type of PSC result produced

PSC_L2_Status_Names = ['Error_Unclassified',
                       'NO_L1_Result',
                       'SG_Clear_Failed',
                       'No_Event_Info',
                       'No_L2_Found',
                       'No_L2_Retrieved',
                       'Invalid_CPT_Result']

EF Result (3 status words)

Dot graph left-down 1st Word: Indicates if fragment valid

Dot graph left-down 2nd Word: Always set to 0

Dot graph left-down 3rd Word: Type of PSC result produced

EF_Status_Word = ['INVALID_FULL_EVENT_FRAGMENT',
                  'MIN_TRIG_EVENT_LOOP_MGR',
                  'NO_EVENT_INFO',
                  'NO_HLT_RESULT',
                  'NO_EF_TRIG_INFO',
                  'NO_STREAM_TAG',
                  'NO_EF_ROB_FRAG',
                  'DUMMY_EVENT_INFO',
                  'DUMMY_HLT_RESULT',
                  'DUMMY_STREAM_TAG'] 

HLT Result (3 status words)

Dot graph left-down 1st Word: Action Taken

actionNames=['CONTINUE','ABORT_CHAIN','ABORT_EVENT','ABORT_JOB']

Dot graph left-down 2nd Word: Reason

reasonNames=['UNKNOWN=0','MISSING_FEATURE','GAUDI_EXCEPTION',
             'EFORMAT_EXCEPTION','STD_EXCEPTION',
             'UNKNOWN_EXCEPTION','NAV_ERROR',
             'MISSING_ROD','CORRUPTED_ROD',
             'TIMEOUT','BAD_JOB_SETUP',
             'USERDEF_1','USERDEF_2',
             'USERDEF_3','USERDEF_4','USERDEF_5']

Dot graph left-down 3rd Word: Internal

internalNames=['UNKNOWN=0','NO_LVL1_ITEMS',
               'NO_LVL2_CHAINS','NO_LVL1_RESULT',
               'WRONG_HLT_RESULT','NO_HLT_RESULT',
               'ALGO_ERROR','TIMEOUT','BAD_JOB_SETUP']


Event Recovery

athenaMT and/or athenaPT is run on the debug stream files with L2ForcedAccept or EFD. We run the trigger from the TriggerDB, using the online configuration of the corresponding run.

The trigger configuration (SM Key, L1/HLT prescale keys and HLT release used in the run) is stored in COOL and it is retrieved using the AtlCoolTrigger.py tool.

The recovery is done by running the trigger on the files stored on CASTOR. Each file contains a large number of events and its reprocessing may fail if there are isolated problematic events causing new crashes and timeouts inside the file. For the time being, this problem is faced by isolating these events; when a problematic event is found, the file is split in two and the problematic event is stored for further investigation. The following diagram shows the principle of the file splitting.


splitting.png

Important notice: The events that are recovered are saved in the recovery stream. They will probably be processed on the Tier0 as a separate dataset to the physics dataset. There is still ongoing discussion on that matter.

What information follows the debug stream processing

Dot graph left-down Plots summarizing the debug stream errors, e.g.:


egPlots.png

Dot graph left-down Plots summarizing trigger information of the recovered debug stream events (athena expert-monitoring.root), e.g.:


egPlots2.png

Dot graph left-down Run summary and event history book-keeping for the reprocessed events.

Dot graph left-down Event samples with specific errors temporarily stored at /afs/cern.ch/t/trigcomm/$year/$run_number/. The events that failed reprocessing are stored in this same place.

Dot graph left-down Recovered events are stored at /castor/cern.ch/grid/atlas/DAQ/recovery/$year/$run_number/.

Recovered Stream integration issues

Useful links

Dot graph left-down Streaming at HLT

Dot graph left-down How to run from the triggerDB

Contacts

Dot graph left-down Hegoi Garitoanandia, Nikhef

Dot graph left-down Anna Sfyrla, UIUC

Dot graph left-down Sander Klous, Nikhef

Dot graph left-down Brian Petersen, CERN





-- AnnaSfyrla - 24 Aug 2008
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-09-02 - AnnaSfyrla
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback