ATLAS DQ2 Get / Put Monitoring

This page documents the development of DQ2 get / put transfer monitoring within the ATLAS DDM Dashboard.

This feature has now been fully implemented. See https://savannah.cern.ch/bugs/index.php?101933

Overview

The proposal is to include dq2-get/put transfer statistics into the ATLAS DDM Dashboard 2.0. http://dashb-atlas-data.cern.ch/ddm2/

The idea is that dq2-get/put statistics should be stored and presented in the same way as subscription-based transfer statistics are currently stored in the DDM Dashboard, see following section. However, initial discussions with DDM developers indicate that it would not be practical or desirable to store and process dq2-get/put transfer events directly in the DDM Dashboard, rather the statistics will be retrieved from existing DDM systems where these events are already stored.

Context: subscription-based transfers

For subscription-based transfers, site services post HTTP requests in real-time to the DDM Dashboard notifying it of transfer events for each individual file and dataset. The DDM Dashboard stores these events in an Oracle database and periodically (every 10 minutes) Oracle stored procedures process the new events. These procedures identify which time bins (10 minutes) are touched by the new events and calculates statistics for those time bins. Further procedures aggregate these statistics into daily time bins.

Statistics are stored in the following schema:

CREATE TABLE t_stats_file (
  src_site        VARCHAR2(100),
  dst_site        VARCHAR2(100),
  state           VARCHAR2(32),
  activity        INTEGER,
  code            INTEGER,
  period_end_time TIMESTAMP,
  files           INTEGER,
  bytes           INTEGER,
  update_time     TIMESTAMP,
  text            VARCHAR2(4000),
  CONSTRAINT pk_sf PRIMARY KEY (src_site, dst_site, state, code, activity, period_end_time)
)

Data sample:

SRC_SITE DST_SITE STATE ACTIVITY CODE PERIOD_END_TIME FILES BYTES UPDATE_TIME TEXT
CERN-PROD_TZERO AGLT2_DATADISK COPIED 6 0 2013-03-07 00:10 17 77378271890 2013-03-07 00:40  
PIC_SCRATCHDISK INFN-MILANO-ATLASC_DATADISK COPIED 6 0 2013-03-07 00:10 3 12798967409 2013-03-07 00:40  
RAL-LCG2_MCTAPE UKI-SCOTGRID-GLASGOW_PRODDISK FAILED_TRANSFER 0 231 2013-03-07 00:10 2 2013-03-07 00:40 [FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] No status updates received since more than [360] seconds. Probably the process serving the transfer is stuck] Duration [0]
RAL-LCG2_MCTAPE UKI-NORTHGRID-MAN-HEP_PRODDISK COPIED 0 0 2013-03-07 00:10 1 1884530057 2013-03-07 00:40  
RRC-KI_DATADISK GRIF-IRFU_DATADISK COPIED 2 0 2013-03-07 00:10 5 100000000 2013-03-07 00:40  
DESY-HH_DATADISK RO-16-UAIC_DATADISK COPIED 2 0 2013-03-07 00:10 5 1000000000 2013-03-07 00:40  
DESY-ZN_DATADISK INFN-ROMA1_DATADISK COPIED 2 0 2013-03-07 00:10 5 10000000000 2013-03-07 00:40  
GOEGRID_PRODDISK FZK-LCG2_MCTAPE FAILED_TRANSFER 0 138 2013-03-07 00:10 1 2013-03-07 00:40 [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] AsyncWait] Duration [0]
GOEGRID_PRODDISK NDGF-T1_DATADISK COPIED 0 0 2013-03-07 00:10 5 22672115 2013-03-07 00:40  

Implementation: dq2-get/put transfers

As mentioned in the Overview section, it is anticipated that dq2-get/put transfer events will not be stored and processed in the DDM Dashboard, rather the statistics will be retrieved from existing DDM systems.

There are a number of ways in which the statistics could be transferred from the existing DDM systems to the DDM Dashboard, e.g. via API, via file export, etc. Some DDM statistics are already produced for Dashboard applications (DDM Accounting) as tsv (tab-separated values) files accessible on a web-server. I believe these tsv files are the output of Hadoop map-reduce jobs. A similar approach may work for dq2-get/put statistics, although this raises a number of questions.

Questions:

  1. Is file export the best approach?
  2. What file format should be used?
  3. How frequently should the files be produced?
  4. How up-to-date can the files be?
  5. Do time bins need to be recalculated due to late arrival of transfer events?
  6. Can error samples also be produced?
  7. ...

Draft Answers:

1. Is file export the best approach?

See comment from Mario

For the sake of having a starting point for discussion, it is assumed that this is the best approach for now.

2. What file format should be used?

Draft file structure (tsv) for statistics (mirroring db table) :

SRC_SITE DST_SITE STATE ACTIVITY CODE PERIOD_END_TIME FILES BYTES TEXT
ROAMING AGLT2_DATADISK COPIED 8 0 2013-03-07 00:10 17 77378271890  
PIC_SCRATCHDISK ROAMING COPIED 8 0 2013-03-07 00:10 3 12798967409  
ROAMING UKI-SCOTGRID-GLASGOW_PRODDISK FAILED_TRANSFER 8 231 2013-03-07 00:10 2 [FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] No status updates received since more than [360] seconds. Probably the process serving the transfer is stuck] Duration [0]
ROAMING UKI-NORTHGRID-MAN-HEP_PRODDISK COPIED 8 0 2013-03-07 00:10 1 1884530057  
ROAMING GRIF-IRFU_DATADISK COPIED 8 0 2013-03-07 00:10 5 100000000  
ROAMING RO-16-UAIC_DATADISK COPIED 8 0 2013-03-07 00:10 5 1000000000  
ROAMING INFN-ROMA1_DATADISK COPIED 8 0 2013-03-07 00:10 5 10000000000  
GOEGRID_PRODDISK ROAMING FAILED_TRANSFER 8 138 2013-03-07 00:10 1 [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] AsyncWait] Duration [0]
ROAMING NDGF-T1_DATADISK COPIED 8 0 2013-03-07 00:10 5 22672115  

Notes:

  • SRC_SITE, DST_SITE identifies the DDM endpoint. ROAMING is the name used for dq2-get/put clients.
  • STATE will always be COPIED or FAILED_TRANSFER in this case.
  • ACTIVITY is always 8 (StageIn/Out see TiersOfATLAS) for dq2-get/put operations.
  • CODE is the error code, or 0 if not applicable. Ideally it would properly categorise errors. For now it is pragmatically generated as the error message length.
  • PERIOD_END_TIME groups statistics in 10 minute time bins, e.g. for period_end_time = 2012/06/02 00:00 the bin is [2012/06/01 23:50, 2012/06/02 00:00).
  • FILES is the number of files.
  • BYTES is the number of bytes.
  • UPDATE_TIME is internal to the DDM Dashboard so is omitted.
  • TEXT is a sample error message for the corresponding error code.

3. How frequently should the files be produced?

Ideally every 10 minutes but this may not be realistic.

4. How up-to-date can the files be?

Ideally all events up to the time the file is produced should be included but this may not be realistic. See comment for question 5.

5. Do time bins need to be recalculated due to late arrival of transfer events?

This depends on the potential delay of transfer events and how up-to-date we want the statistics files. For example, if the transfer events are always stored within minutes and the statistics files exclude events in the last hour, say, then we can assume that the statistics files include all transfer events up to the last hour.

6. Can error samples also be produced?

If the existing DDM systems do not store error messages or it is not easy to process them for statistics, then we could proceed with an implementation that does not include error messages or categorisation by error code in the first instance.

7. ...

This is just the start of the discussion. There are not doubt many more issues to be addressed.

Comments

On 07/03/13 15:00, Mario Lassnig wrote:

Hi David,

All the data is in Oracle already, so you might just connect with a reader account to the Active Data Guard, and do a periodical insert into dashboard_table select columns,count(),sum() from t_traces group by columns?

Cheers, Mario


On 03/07/2013 15:55, David Tuckett wrote:

Hi Mario,

Yes that would be much simpler for both of us. I wasn't aware that this data was in Oracle.

How would I go about getting a reader account to the Active Data Guard?

Cheers, David


Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2013-10-17 - DavidTuckett
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback