FTS Evaluations:

CMS monitors the network link performance between transfer endpoints of sites. The monitoring is based on the transfer logs of the CERN File Transfer Service, FTS, that is used by both the Run 1,2 data transfer system, PhEDEx, and the Run 3,4 data management system, Rucio. The logs of CMS data transfers are analyzed and a metric for endpoint-to-endpoint links, source endpoints, destination endpoints, and sites derived. The metric is used to determine the link status in the CMS Site Readiness evaluations.

FTS Metrics:

The evaluation of transfer logs from FTS leads to four metrics: for a 15 minute interval, a 1 hour, a 6 hour, and an interval of a full day. An evaluation every 15 minute can alert quickly to complete failures, while longer intervals are required to detect degraded links or links between endpoints with very low file transfer activity. Evaluations are done in steps, starting with a log analysis and classifying failed transfers into likely source, transfer, or destination issues, an endpoint-to-endpoint evaluation, identifying and excluding dysfunctional endpoints, a source and destination host evaluation (that excludes identified destination/source issues), and finally a site evaluation based on the source and destination evaluation of the endpoints belonging to the site. The metrics derive a status of ok, warning, error, or unknown and a quality value between 0 and 1.0. The metrics/evaluation results are send to the CERN MonIT system and stored in HDFS at /project/monitoring/archive/cmssst/raw/ssbmetric/>metric-name</ with metric-name being fts15min, fts1hour, fts6hour, and fts1day.

Evaluation Details:

FTS logs with topic fts_raw_complete that are VO-tagged with cms are fetched from CERN MonIT for a given time interval, checked for consistency, and then classified based on tr_error_scope, t_error_code, t__error_message, and f_size into
  • trn_ok  : for all successful transfers
  • trn_usr : in case the file size was excessive (>=20GB), the user certificate expired, or the transfer cancelled by the user
  • trn_tout: in case of connection or transfer timeouts
  • trn_err : in case of other network/transfer related errors
  • src_perm: in case of file/directory read/access permissions at the source endpoint
  • src_miss: in case the requested file/path did not exist at the source endpoint
  • src_err : in case of other source endpoint associated errors
  • dst_perm: in case of write/over-write permission/authorization errors at the destination endpoint
  • dst_path: in case of directory creation permission/authorization errors at the destination endpoint
  • dst_spce: in case of quota/insufficient free space being available at the destination endpoint
  • dst_err : in case of other destination endpoint associated errors
The error analysis is limited by the error codes and messages the various transfer protocols, tools, and service implementations provide. It needs to be maintained and updated as new protocols and tools are being used.

Link Evaluation:

Each transfer log is associated to a source--destination endpoint link. The number of file and Byte transfers in each of the above classes are summed up and the link quality and status derived:
        quality = max( Files(trn_ok) / Files(total) , Bytes(trn_ok) / Bytes(total) )
i.e. the higher of the successful file and Byte transfers. This is done to avoid small unsuccessful file transfers skewing the metric. Quality is rounded to three digits after the decimal point. If a source--destination endpoint link has no transfers within the interval quality is set to 0.0 and status to unknown. If all transfers are successful or the number of successful transfers minus one divided by the total transfers is larger than 0.5 status is set to ok. If the number of successful transfers plus one divided by the total transfers is smaller than 0.5 status is set to error. In neither of the cases status is set to warning. The plus/minus one calculation is used to introduce a kind of significance into the status evaluation due to many links, especially in the quarter hour/one hour interval having only a small number of transfers. For small file transfers the status evaluation is sown in the table below:
  trn_ok
total 0 1 2 3 4 5 6 7 ...
1 w ok  
2 w w ok  
3 e w w ok  
4 e w w w ok  
5 e e w w ok ok  
6 e e w w w ok ok  
7 e e e w w ok ok ok
8 e e e w w w ok ok ok
9 ...
The detail in the FTS metrics lists the number of files and Bytes in the above classes and provides a link to the FTS logs for the first FTS transfer encountered with the error class.

Dysfunctional Endpoint Identification:

Next dysfunctional source and destination endpoints are identified. For this a lower quality threshold of 0.25 is used. Files/Bytes of links with an unknown status evaluation are combined and then their combined quality calculated and checked against the 0.25 threshold. If the total number of links with quality below the threshold minus one divided by the number of links (the combined unknown ones counted as one link) is above 0.75, the source or destination endpoint is marked as dysfunctional and links to or from it excluded from other endpoint evaluations.

Source Endpoint Evaluation:

Source endpoints are evaluated by re-calculating the link-quality as above but excluding dysfunctional destination endpoints and all counts in the dst_* classes:
        quality_link = max( Files(trn_ok) / Files(trn_*, src_*) , Bytes(trn_ok) / Bytes(trn_*, src_*) )
i.e. excluding identified destination errors. Similarly to the above link evaluation a link-status is derived based on the minus/plus one and 0.5 threshold. The source endpoints quality is obtained by summing up the link-quality of all links with an ok or error status and combining files/bytes of links with an unknown status (and adding the combined quality calculation). The number of links with ok status (plus status evaluation for the combined unknown links) is counted and the source endpoint quality derived as:
        quality = max( Links(ok) / Links(total) , Sum(quality) / Links(total) )
i.e. also here, the higher of the link ratio and average quality is used. This is done to better handle extreme quality values in case of a small number of transfers. Quality is rounded to three digits after the decimal point. If a source endpoint link has no transfers within the interval quality is set to 0.0 and status to unknown. If the number of links with link-status of ok minus one divided by the number of links (the combined unknown ones counted as one link) is larger than 0.5 status is set to ok. If the number of links with link-status of ok plus one divided by the number of links is smaller than 0.5 status is set to error. In neither of the cases status is set to warning.

Destination Endpoint Evaluation:

Destination endpoints are evaluated in a similar manner but with src_* classes excluded in the link-quality/link-status re-calculation.

Site Evaluation:

Finally site status is evaluated based on the endpoints belonging to the site according to the VO-feed and considering source and destination of each endpoint. The quality of a site is the lowest quality of any source or destination endpoint. The status is error if any source or destination endpoint of the site has a status of error. Otherwise, if any source or destination endpoint is without transfers, status is unknown. Otherwise, if any source or destination endpoint is with status of warning, the site status is set to warning and if all source and destination endpoints have a status of ok, site status is set to ok. The detail in the metric lists total successful file/Byte transfers from (i.e. site as source) and to the endpoints at the site. It provides a count of the error counts in the relevant classes for from and to the site, summary of link status from and to the site and a list of from (source) and to (destination) status of each endpoint of the site.

Access:

The different FTS evaluations are all available in both Hadoop (path /project/monitoring/archive/cmssst/raw/ssbmetric/) and ElasticSearch (index monit_prod_cmssst_*) of CERN MonIT. The document names are "fts15min", "fts1hour", "fts6hour", and "fts1day". All have the same data elements, "name" (name of the site or endpoint), "type" (one of link, source, destination, or site), "status" (one of ok, warning, error, or unknown), "quality" (a floating point number between 0 and 1), and "detail" (evaluation details). Timestamps in MonIT are in milliseconds since the epoch, all times are in UTC.

Support:

The script to evaluate the FTS metric runs every 15 minutes via crontab on vocms777. The four different evaluations are done based on the time of execution and assume most FTS logs are available after 17 minutes and all logs are in MonIT after an hour.

Useful Links:

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2020-02-05 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback