Introduction

This documentation is intended to illustrate basic methods for unstucking rules using dmwm (https://github.com/nsmith-/dmwmclient) client tools and the Rucio command line: https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsLocation. The new Rucio nomenclature for data elements is used in this page, meaning that the word 'container' is used to refer to what was formerly known as 'datasets' in phedex, and the word 'dataset' in this twiki is used to designate data elements formerly known as 'blocks' in phedex. 'Lfn' is used to designate the logical name of a file.

We also show a quick trick you can use to get FTS IDs of transfers, having in advance a rule ID which triggers those transfers. This trick is extremely useful for unstucking rules and spoting specific problems in the grid.

Method 1: Get problematic lfns knowing in advance the ID of a stuck or replicating rule. Then use Kibana.

In many situations when transfer issues are occurring, the only piece of information that data operators have at hand is the Rule ID and the rule state. When the rule is stuck or if it has been replicating for a long time, it necessary to access FTS error logs (https://monit-kibana.cern.ch/kibana/goto/8a9ca30193fa7583cdf96efc0a72b86e) in order to get more information about the files which are having transfer issues. However, it is not straightforward the process by which it is possible to get a FTS transfer logs from the rule ID. Here we explain how to get from a sutck or replicating Rule's ID, FTS error logs, by getting a list of lfn's upon which that rule applies. In this method, we use dmwm client along with basic rucio commands for linux.

  • Get information about the rule. Run the following command on a lxplux machine, where rule_ID is the id of the rule:

$ rucio rule-info rule_ID

Example:

$ rucio rule-info 4b2b023d44c743cfa42f3b60fccdaf0d

Id:                         4b2b023d44c743cfa42f3b60fccdaf0d
Account:                    crab_tape_recall
Scope:                      user.crab_tape_recall
Name:                       /TapeRecall/210121_154306.lbenato_crab_gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8-v1/USER
RSE Expression:             ddm_quota>0&(tier=1|tier=2)&rse_type=DISK
Copies:                     1
State:                      OK
Locks OK/REPLICATING/STUCK: 2/2/3
Grouping:                   ALL
Expires at:                 2021-02-04 15:43:54
Locked:                     False
Weight:                     ddm_quota
Created at:                 2021-01-21 15:43:54
Updated at:                 2021-01-29 02:34:55
Error:                      None
Subscription Id:            None
Source replica expression:  None
Activity:                   Analysis Input
Comment:                    Staged from tape for lbenato
Ignore Quota:               False
Ignore Availability:        False
Purge replicas:             False
Notification:               NO
End of life:                None
Child Rule Id:              None

  • Get the DID of the dataset or container upon which the rule is placed. In this case, the DID is user.crab_tape_recall:/TapeRecall/210121_154306.lbenato_crab_gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8-v1/USER. You can check that, by looking at the information displayed by the previous command.

  • Verify whether the DID refers to a container, dataset or file. Remember that the DID of a data element is composed by its scope and name. Generally, the scope of a DID is 'cms', but in this particular example we can see that the scope is 'crab_tape_recall'. Also, in this particular example it is not clear whether this DID is referring to a dataset or container. Therefore, run:

$ rucio list-content scope:name

where the 'scope' and name of the DID were provided by the previous command (rucio rule-info rule_ID). Following the case of the example:

$ rucio list-content user.crab_tape_recall:/TapeRecall/210121_154306.lbenato_crab_gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8-v1/USER

+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| SCOPE:NAME                                                                                                                                              | [DID TYPE]   |
|---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------|
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#03a1e563-9761-4cb7-9266-fdf7358fd23d | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#5881bb44-2768-4766-b9fe-b558b8f24c1e | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#644dc1b5-9bc5-424a-b840-6cf69b9b71b0 | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#71f9c3b7-357b-45cf-8e21-5a38971b8f28 | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#a8acd388-4233-48fb-9ea2-9d693a0d8467 | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#abe7984d-793d-492e-aa7d-c9943b007c98 | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#ddf6b19d-a9f0-43f5-a505-523c6e7c6e88 | DATASET      |
| cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#e4258e39-0108-4035-91ce-ed773a157f80 | DATASET      |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+

In that table, it is possible to see that the data element user.crab_tape_recall:/TapeRecall/210121_154306.lbenato_crab_gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8-v1/USER is a container because its constituents are datasets. You can also have the case in which the DID refers to a dataset and its constituents would be files.

  • If the DID refers to a dataset, by running the previous command you would have access to all the names of the files within that dataset. However, it is more useful to know the names of the problematic files only, as well as the name of the possible site(s) to which those files are trying to be transferred. Therefore, it is necessary to run:

 
 $ rucio list-dataset-replicas cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#a8acd388-4233-48fb-9ea2-9d693a0d8467

DATASET: cms:/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#a8acd388-4233-48fb-9ea2-9d693a0d8467
+-----------------+---------+---------+
| RSE             |   FOUND |   TOTAL |
|-----------------+---------+---------|
| T2_IT_Legnaro   |       15 |       20 |
| T1_US_FNAL_Tape |       20 |       20 |
+-----------------+---------+---------+


Note that the DID here is composed by the scope (cms), and the name of a dataset (/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#a8acd388-4233-48fb-9ea2-9d693a0d8467). In that example, it is possible to see that the dataset is missing 5 files at T2_IT_Legnaro, which are exactly the cause by which the state of the rule is stuck or replicating. Also, when we're looking for FTS error logs, we will filter out by the destination name 'T2_IT_Legnaro'. Nevertheless, we don't have yet the name of those files. In order to get those LFNs, we can use the dmwm client, which installation instructions can be found in here: https://github.com/nsmith-/dmwmclient. Once installed and running, we run the following lines in the interactive mode of the client:

 

number_of_missing_files = 5
stuck_files = []

replicas = await client.rucio.list_replicas(scope = 'cms', name = '/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#a8acd388-4233-48fb-9ea2-9d693a0d8467')
replicas = replicas.drop_duplicates(subset = 'replica')
replicas = replicas['replica'].to_list()
 
files = await client.rucio.list_content(scope = 'cms', name = '/gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8/RunIIAutumn18DRPremix-102X_upgrade2018_realistic_v15-v1/AODSIM#a8acd388-4233-48fb-9ea2-9d693a0d8467')
files = files['lfn'].to_list()

for file in files:
     df = await client.rucio.list_replicas(scope = 'cms', name = file)
     rses = df.drop_duplicates(subset = 'replica')
     rses = rses['replica'].to_list()
     if set(replicas) != set(rses):
          stuck_files.append(file)
     if len(stuck_files) == number_of_missing_files:
          break

 

In the previous code, you'll get in the variable 'stuck_files' a list with the lfn of the problematic files.

  • On the other hand, if the DID you get is referring to a container, you would need to check dataset by dataset of that container, in order to get the name of the problematic files. You can use the dmwm client to achive that, by runing the following code:

 

number_of_missing_files = #Check the number of stuck rules + the number of replicating rules
stuck_files = []

df = await client.rucio.list_content(scope = 'user.crab_tape_recall', name = '/TapeRecall/210121_154306.lbenato_crab_gluinoGMSB_M2400_ctau3p0_TuneCP2_13TeV_pythia8-v1/USER')
df.drop_duplicates(subset = 'lfn')
datasets = df['lfn'].to_list()

for dataset in datasets:
   replicas = await client.rucio.list_replicas(scope = 'cms', name = dataset)
   replicas = replicas.drop_duplicates(subset = 'replica')
   replicas = replicas['replica'].to_list()
 
   files = await client.rucio.list_content(scope = 'cms', name = dataset)
   files = files['lfn'].to_list()

   for file in files:
        df = await client.rucio.list_replicas(scope = 'cms', name = file)
        rses = df.drop_duplicates(subset = 'replica')
        rses = rses['replica'].to_list()
        if set(replicas) != set(rses):
             stuck_files.append(file)
        if len(stuck_files) == number_of_missing_files:
             break
   if len(stuck_files) == number_of_missing_files:
             break       
 

Again, the variable 'stuck_files' is the list with the lfn of the problematic files of the whole container.

What to do after getting the LFNs of the problematic files

Go to https://monit-kibana.cern.ch/kibana/goto/8a9ca30193fa7583cdf96efc0a72b86e and use the error logs concerning each or some of those files. You can filter out by each of the LFNs you just got from the previous steps. There you'll find concrete information about transfer errors that will allow you to spot the problem specifically.

Method 2: Get FTS IDs from Rules IDs. Then use https://fts3.cern.ch:8449/fts3/ftsmon/#/ to get FTS logs


data.rule_id:d52ae4222a1645f6be929a8b003f29aa

where we're using the rule ID d52ae4222a1645f6be929a8b003f29aa as example. You will get several logs of different transfers attempts. If you have a closer look at each of those logs, you'll see a variable called 'data.request_id', which is an identifier of that transfer request. Copy one of those request_id.


data.request_id:0aed206fd4d041fab665b22dc4a0bdd3

where 0aed206fd4d041fab665b22dc4a0bdd3 is one of those transfer Identifiers we copied from the previous step. Just like you did in the previous step, look for a variable called 'data.external_id'. That's a FTS Transfer ID. Copy it.

  • Go to https://fts3.cern.ch:8449/fts3/ftsmon/#/. In the Job ID box, paste one of the data.external_id of the previous step. Then, you'll be shown the FTS logs regarding transfers triggered by the rule ID you identified in advance.

Take into account that some FTS IDs may be old enough and it is probable that you will not find any logs.

-- OscarFernandoGarzonMiguez - 2021-02-02

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2021-02-04 - OscarFernandoGarzonMiguez
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback