ATLAS DQ2 Get / Put Monitoring
This page documents the development of DQ2 get / put transfer monitoring within the ATLAS DDM Dashboard.
This feature has now been fully implemented. See
https://savannah.cern.ch/bugs/index.php?101933
Overview
The proposal is to include dq2-get/put transfer statistics into the ATLAS DDM Dashboard 2.0.
http://dashb-atlas-data.cern.ch/ddm2/
The idea is that dq2-get/put statistics should be stored and presented in the same way as subscription-based transfer statistics are currently stored in the DDM Dashboard, see following section. However, initial discussions with DDM developers indicate that it would not be practical or desirable to store and process dq2-get/put transfer events directly in the DDM Dashboard, rather the statistics will be retrieved from existing DDM systems where these events are already stored.
Context: subscription-based transfers
For subscription-based transfers, site services post HTTP requests in real-time to the DDM Dashboard notifying it of transfer events for each individual file and dataset. The DDM Dashboard stores these events in an Oracle database and periodically (every 10 minutes) Oracle stored procedures process the new events. These procedures identify which time bins (10 minutes) are touched by the new events and calculates statistics for those time bins. Further procedures aggregate these statistics into daily time bins.
Statistics are stored in the following schema:
CREATE TABLE t_stats_file (
src_site VARCHAR2(100),
dst_site VARCHAR2(100),
state VARCHAR2(32),
activity INTEGER,
code INTEGER,
period_end_time TIMESTAMP,
files INTEGER,
bytes INTEGER,
update_time TIMESTAMP,
text VARCHAR2(4000),
CONSTRAINT pk_sf PRIMARY KEY (src_site, dst_site, state, code, activity, period_end_time)
)
Data sample:
SRC_SITE |
DST_SITE |
STATE |
ACTIVITY |
CODE |
PERIOD_END_TIME |
FILES |
BYTES |
UPDATE_TIME |
TEXT |
CERN-PROD_TZERO |
AGLT2_DATADISK |
COPIED |
6 |
0 |
2013-03-07 00:10 |
17 |
77378271890 |
2013-03-07 00:40 |
|
PIC_SCRATCHDISK |
INFN-MILANO-ATLASC_DATADISK |
COPIED |
6 |
0 |
2013-03-07 00:10 |
3 |
12798967409 |
2013-03-07 00:40 |
|
RAL-LCG2_MCTAPE |
UKI-SCOTGRID-GLASGOW_PRODDISK |
FAILED_TRANSFER |
0 |
231 |
2013-03-07 00:10 |
2 |
2013-03-07 00:40 |
[FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] No status updates received since more than [360] seconds. Probably the process serving the transfer is stuck] Duration [0] |
RAL-LCG2_MCTAPE |
UKI-NORTHGRID-MAN-HEP_PRODDISK |
COPIED |
0 |
0 |
2013-03-07 00:10 |
1 |
1884530057 |
2013-03-07 00:40 |
|
RRC-KI_DATADISK |
GRIF-IRFU_DATADISK |
COPIED |
2 |
0 |
2013-03-07 00:10 |
5 |
100000000 |
2013-03-07 00:40 |
|
DESY-HH_DATADISK |
RO-16-UAIC_DATADISK |
COPIED |
2 |
0 |
2013-03-07 00:10 |
5 |
1000000000 |
2013-03-07 00:40 |
|
DESY-ZN_DATADISK |
INFN-ROMA1_DATADISK |
COPIED |
2 |
0 |
2013-03-07 00:10 |
5 |
10000000000 |
2013-03-07 00:40 |
|
GOEGRID_PRODDISK |
FZK-LCG2_MCTAPE |
FAILED_TRANSFER |
0 |
138 |
2013-03-07 00:10 |
1 |
2013-03-07 00:40 |
[FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] AsyncWait] Duration [0] |
GOEGRID_PRODDISK |
NDGF-T1_DATADISK |
COPIED |
0 |
0 |
2013-03-07 00:10 |
5 |
22672115 |
2013-03-07 00:40 |
|
Implementation: dq2-get/put transfers
As mentioned in the Overview section, it is anticipated that dq2-get/put transfer events will not be stored and processed in the DDM Dashboard, rather the statistics will be retrieved from existing DDM systems.
There are a number of ways in which the statistics could be transferred from the existing DDM systems to the DDM Dashboard, e.g. via API, via file export, etc. Some DDM statistics are already produced for Dashboard applications (DDM Accounting) as tsv (tab-separated values) files accessible on a web-server. I believe these tsv files are the output of Hadoop map-reduce jobs. A similar approach may work for dq2-get/put statistics, although this raises a number of questions.
Questions:
- Is file export the best approach?
- What file format should be used?
- How frequently should the files be produced?
- How up-to-date can the files be?
- Do time bins need to be recalculated due to late arrival of transfer events?
- Can error samples also be produced?
- ...
Draft Answers:
1. Is file export the best approach?
See comment from Mario
For the sake of having a starting point for discussion, it is assumed that this is the best approach for now.
2. What file format should be used?
Draft file structure (tsv) for statistics (mirroring db table) :
SRC_SITE |
DST_SITE |
STATE |
ACTIVITY |
CODE |
PERIOD_END_TIME |
FILES |
BYTES |
TEXT |
ROAMING |
AGLT2_DATADISK |
COPIED |
8 |
0 |
2013-03-07 00:10 |
17 |
77378271890 |
|
PIC_SCRATCHDISK |
ROAMING |
COPIED |
8 |
0 |
2013-03-07 00:10 |
3 |
12798967409 |
|
ROAMING |
UKI-SCOTGRID-GLASGOW_PRODDISK |
FAILED_TRANSFER |
8 |
231 |
2013-03-07 00:10 |
2 |
[FTS] FTS State [Failed] FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] No status updates received since more than [360] seconds. Probably the process serving the transfer is stuck] Duration [0] |
ROAMING |
UKI-NORTHGRID-MAN-HEP_PRODDISK |
COPIED |
8 |
0 |
2013-03-07 00:10 |
1 |
1884530057 |
|
ROAMING |
GRIF-IRFU_DATADISK |
COPIED |
8 |
0 |
2013-03-07 00:10 |
5 |
100000000 |
|
ROAMING |
RO-16-UAIC_DATADISK |
COPIED |
8 |
0 |
2013-03-07 00:10 |
5 |
1000000000 |
|
ROAMING |
INFN-ROMA1_DATADISK |
COPIED |
8 |
0 |
2013-03-07 00:10 |
5 |
10000000000 |
|
GOEGRID_PRODDISK |
ROAMING |
FAILED_TRANSFER |
8 |
138 |
2013-03-07 00:10 |
1 |
[FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] AsyncWait] Duration [0] |
ROAMING |
NDGF-T1_DATADISK |
COPIED |
8 |
0 |
2013-03-07 00:10 |
5 |
22672115 |
|
Notes:
- SRC_SITE, DST_SITE identifies the DDM endpoint. ROAMING is the name used for dq2-get/put clients.
- STATE will always be COPIED or FAILED_TRANSFER in this case.
- ACTIVITY is always 8 (StageIn/Out see TiersOfATLAS) for dq2-get/put operations.
- CODE is the error code, or 0 if not applicable. Ideally it would properly categorise errors. For now it is pragmatically generated as the error message length.
- PERIOD_END_TIME groups statistics in 10 minute time bins, e.g. for period_end_time = 2012/06/02 00:00 the bin is [2012/06/01 23:50, 2012/06/02 00:00).
- FILES is the number of files.
- BYTES is the number of bytes.
- UPDATE_TIME is internal to the DDM Dashboard so is omitted.
- TEXT is a sample error message for the corresponding error code.
3. How frequently should the files be produced?
Ideally every 10 minutes but this may not be realistic.
4. How up-to-date can the files be?
Ideally all events up to the time the file is produced should be included but this may not be realistic. See comment for question 5.
5. Do time bins need to be recalculated due to late arrival of transfer events?
This depends on the potential delay of transfer events and how up-to-date we want the statistics files. For example, if the transfer events are always stored within minutes and the statistics files exclude events in the last hour, say, then we can assume that the statistics files include all transfer events up to the last hour.
6. Can error samples also be produced?
If the existing DDM systems do not store error messages or it is not easy to process them for statistics, then we could proceed with an implementation that does not include error messages or categorisation by error code in the first instance.
7. ...
This is just the start of the discussion. There are not doubt many more issues to be addressed.
Comments
On 07/03/13 15:00, Mario Lassnig wrote:
Hi David,
All the data is in Oracle already, so you might just connect with a
reader account to the Active Data Guard, and do a periodical
insert into dashboard_table select columns,count(),sum() from t_traces
group by columns?
Cheers,
Mario
On 03/07/2013 15:55, David Tuckett wrote:
Hi Mario,
Yes that would be much simpler for both of us. I wasn't aware that this
data was in Oracle.
How would I go about getting a reader account to the Active Data Guard?
Cheers,
David