Getting Productions to 100%

Assuming there is no external problems, when a production starts, and Data Quality shifters follow it, It is usually easy to reach a 95% of processed files without much intervention. Anyway, there are still cases where a human intervention is required. Throughout this page I'll give some examples.

WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING

  • Applying what is written here is potentially dangerous and should be made only by experts.
WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING

Analysing a reconstruction request

The "workflow" of a reconstruction request could be summarised as follows:

  1. Before a RAW file is even reconstructed, a Data Quality flag ("OK") has to be made by the DQ shifter.
  2. There is a "DataReconstruction" production that take RAW files as input and produces stream DST.
  3. The outputs of this first step have to get a Data Quality check. This flag is registered in the bookkeeping.
  4. In case the DQ flag is "OK", the streams are merged. For the example that follows, there are 11 "Merging" productions.
  5. The merged files are distributed by one or more "Replication" production. Note that this step is not always mandatory.

For the purposes of these examples, we'll have a look at request 1926:

Req1926.png

Shifters and GEOCs can get quite a lot of information by simply looking at the Production monitor page.

Looking why the Reconstruction production 8178 still has 2 files in status "Assigned".

Where do I get this information? Very simple: I just look at the "File Status" for that production. Two files are still in "Assigned" state, but I want to look if there were real jobs that tried to process it: to do that the fastest thing is to connect directly to the database (which today is the ProductionDB on volhcb22), but you can get this information also by looking at the production and jobs pages (the job name of every production is composed by the production number and the "TaskID" number).

mysql> SELECT * FROM TransformationTasks WHERE TransformationID = 8178 AND TaskID IN (22103, 21849); 
+--------+------------------+----------------+------------+-----------+---------------------+---------------------+-----------+
| TaskID | TransformationID | ExternalStatus | ExternalID | TargetSE  | CreationTime        | LastUpdateTime      | RunNumber |
+--------+------------------+----------------+------------+-----------+---------------------+---------------------+-----------+
|  21849 |             8178 | Done           | 12614650   | IN2P3-RAW | 2010-11-08 14:12:43 | 2010-11-09 14:05:36 |     81356 |
|  22103 |             8178 | Done           | 12616481   | IN2P3-RAW | 2010-11-08 14:15:07 | 2010-11-09 06:27:43 |     81356 |
+--------+------------------+----------------+------------+-----------+---------------------+---------------------+-----------+
2 rows in set (0.00 sec)

I got the TaskIDs from the web page simply looking at "File status" for that production. First thing to notice, is that these ran at IN2P3, so it seems to be the usual case of the payload killed at IN2P3 that anyway exits with status = 0. This happens because IN2P3 has its own batch system (BQS) which in any case will be replaced hopefully soon.

I have a look at the jobs in the Jobs monitoring, and they have no logs. This is normal, because the applications could not finalize, so the logs were not uploaded.

Also, they have no pilot output, but this might be just the case where the LB lost track of the pilot:

[lxplus303] ~/Jobs $ dirac-admin-get-pilot-output 12614650
ERROR 12614650: Failed to determine owner for pilot 12614650
[lxplus303] ~/Jobs $ dirac-admin-get-pilot-output 12616481
ERROR 12616481: Failed to determine owner for pilot 12616481
Anyway, I could get the std.out:
[lxplus303] ~/Jobs $ dirac-wms-job-get-output 12616481
2010-11-11 10:57:08 UTC dirac-wms-job-get-output/DiracAPI  INFO: Files retrieved and extracted in /afs/cern.ch/user/f/fstagni/Jobs/12616481
Job output sandbox retrieved in 12616481/

By looking at it, it ends abruptly. It really seems that it has been killed. Same for the other one.

I make a last check:

[lxplus303] ~/Jobs $ dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81356/081356_0000000280.raw 9
Successful:
/lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81356/081356_0000000280.raw:
Faild: []
[lxplus303] ~/Jobs $ dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81356/081356_0000000023.raw 9
Successful:
/lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81356/081356_0000000023.raw:
Faild: []

I conclude that the file has not been processed: I then set their status as "Unused":

mysql> SELECT * FROM TransformationFiles WHERE TransformationID = 8178 AND TaskID IN (22103, 21849); 
+------------------+---------+----------+------------+--------+----------+-----------+---------------------+---------------------+-----------+
| TransformationID | FileID  | Status   | ErrorCount | TaskID | TargetSE | UsedSE    | LastUpdate          | InsertedTime        | RunNumber |
+------------------+---------+----------+------------+--------+----------+-----------+---------------------+---------------------+-----------+
|             8178 | 3865218 | Assigned |          0 | 21849  | Unknown  | IN2P3-RAW | 2010-11-08 14:12:43 | 2010-11-08 14:10:14 |     81356 |
|             8178 | 3864794 | Assigned |          0 | 22103  | Unknown  | IN2P3-RAW | 2010-11-08 14:15:07 | 2010-11-08 14:10:14 |     81356 |
+------------------+---------+----------+------------+--------+----------+-----------+---------------------+---------------------+-----------+
2 rows in set (0.06 sec)

BE VERY -VERY- CAREFUL WHEN DOING THIS OPERATION!!!

mysql> UPDATE TransformationFiles SET Status = 'Unused' WHERE TransformationID = 8178 AND TaskID IN (22103, 21849); 
Query OK, 2 rows affected (0.07 sec)
Rows matched: 2  Changed: 2  Warnings: 0

Looking why Merging production 8187 still had 30 files in status "Assigned".

This is a different case from the one before. There is one merging job that tried to merge these files:

mysql> SELECT DISTINCT TaskID FROM TransformationFiles WHERE Status = 'Assigned' AND TransformationID = 8187;
+--------+
| TaskID |
+--------+
| 239    |
+--------+
1 row in set (0.02 sec)

Then I have a look at which DIRAC Job ID tried to merge these files:

mysql> SELECT * FROM TransformationTasks WHERE TaskID = 239 AND TransformationID = 8187;
+--------+------------------+----------------+------------+----------+---------------------+---------------------+-----------+
| TaskID | TransformationID | ExternalStatus | ExternalID | TargetSE | CreationTime        | LastUpdateTime      | RunNumber |
+--------+------------------+----------------+------------+----------+---------------------+---------------------+-----------+
|    239 |             8187 | Failed         | 12631051   | RAL-DST  | 2010-11-08 20:20:45 | 2010-11-12 04:26:59 |     81311 |
+--------+------------------+----------------+------------+----------+---------------------+---------------------+-----------+
1 row in set (0.00 sec)

I discover this job is stalled. The job application ran to completion in the RAL queue when it was already in draining state before a downtime. Then it started to finalize, but I assume that, because the SEs were banned, the output went to Failover, from which it was picked up lately without problems. So, the output is now replicated and registered in the bookkeeping. The merged file even has a DataQuality Flag = 'OK'. However, I don't know why but the job was still in the queue, and stalled about 36 hours later, because the pilot was aborted.

So, what I did was again to mark these files as "Processed".

The reason why there wasn't a new job created for this case is because the DataRecoveryAgent checks if the file has already descendents registered in the bookkeeping before creating a new job. When it happens, it just stops and print out a line in the log.

Looking why there is a quite important mismatch in the number of files merged or to be merged and those producted by the DataReconstruction production

Probably there are some runs that didn't go to merging status... is this because they have not been yet flagged or not? First of all, let's see who's left over:

mysql> SELECT * FROM TransformationRuns WHERE TransformationID = 8178 AND RunNumber NOT IN (SELECT DISTINCT RunNumber FROM TransformationRuns WHERE TransformationID = 8179);
+------------------+-----------+---------------+--------+---------------------+
| TransformationID | RunNumber | SelectedSite  | Status | LastUpdate          |
+------------------+-----------+---------------+--------+---------------------+
|             8178 |         0 | LCG.IN2P3.fr  | Active | 2010-10-28 08:50:16 |
|             8178 |     81597 | LCG.SARA.nl   | Active | 2010-10-29 15:12:27 |
|             8178 |     81605 | LCG.GRIDKA.de | Active | 2010-10-29 15:34:03 |
|             8178 |     81606 | LCG.GRIDKA.de | Active | 2010-10-29 15:34:04 |
|             8178 |     81610 | LCG.GRIDKA.de | Active | 2010-10-30 09:25:03 |
|             8178 |     81611 | LCG.GRIDKA.de | Active | 2010-10-30 09:25:04 |
|             8178 |     81608 | LCG.GRIDKA.de | Active | 2010-10-30 09:57:02 |
|             8178 |     81609 | LCG.GRIDKA.de | Active | 2010-10-30 09:57:03 |
|             8178 |     81356 | LCG.IN2P3.fr  | Active | 2010-11-08 14:12:31 |
+------------------+-----------+---------------+--------+---------------------+
9 rows in set (0.00 sec)

(8178 is the Reconstruction prod, 8179 is the merging one)

Just to check, I looked at also the others merging and list of runs is the same. The run with number 0 does not seem to be a problem (re-check!).

Then, I look at the bookkeeping for the run files:

[lxplus303] ~/Jobs $ dirac-bookkeeping-run-files 81609

Get the descendents of one of these:

[lxplus303] ~/Jobs $ dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81609/081609_0000000035.raw 9

And check if they have been flagged:

[lxplus303] ~/Jobs $ dirac-bookkeeping-file-metadata /lhcb/data/2010/DST/00008178/0001/00008178_00017796_2.Dimuon.dst

I discovered they were "Unchecked" so I inform the DQ manager (Marco Adinolfi) that the runs have not been yet flagged.

A couple of useful queries

This last section is meant to give some practical suggestions to experts. First of all we strongly recommend to avoid to run any SQL UPDATE statement w/o running beforehand the equivalent SELECT statement. You have to be 100% sure to act on the desired rows. Furthermore we strongly suggest to avoid to run any SQL UPDATE statement by trying to modify more than one field at the time. Better to split the changes in two update statements than trying to do all in once (this is a known SQL bug an you might screw up your table).

It might happen often that the number of files in Assigned is so large that is not convenient jumping from a table (TransformationTasks) to another (TransformationFiles) selecting TaskID form one side and copy them in a select query against the other... Isolating files in Status Assigned in one table must be the input for querying the ExternalTasks status. Concatenating select queries is expensive and it happened that we killed the DB for several hours leaving it running (w/o results after all). This is why this query (just change the TransformationID) is useful for investigating on large number of files not Processed.

select TransformationFiles.FileID, TransformationTasks.TaskID, TransformationFiles.Status, TransformationTasks.ExternalStatus,TransformationTasks.ExternalID from TransformationFiles JOIN TransformationTasks ON (TransformationTasks.TaskID=TransformationFiles.TaskID and TransformationFiles.TransformationID=TransformationTasks.TransformationID) where TransformationFiles.TransformationID=10886 and TransformationFiles.Status="Assigned" order by TransformationFiles.FileID; 

An even more heavy and complicated query (but this saves a lot of time if one has just to check the descendants of the input files for the various "problematic" tasks) is something like that (please note that I put also ExternalStatus = Completed or Done as it happens that the large majority of the cases of files in Assigned but job completed is just due to the file status not changed (diset issue) while the file has produced descendant.

mysql> select DataFiles.LFN, TransformationFiles.FileID, TransformationTasks.TaskID, TransformationFiles.Status, TransformationTasks.ExternalStatus,TransformationTasks.ExternalID from TransformationFiles JOIN TransformationTasks ON (TransformationTasks.TaskID=TransformationFiles.TaskID and TransformationFiles.TransformationID=TransformationTasks.TransformationID) JOIN DataFiles ON (DataFiles.FileID=TransformationFiles.FileID) where TransformationFiles.TransformationID=10886 and TransformationFiles.Status="Assigned" and (TransformationTasks.ExternalStatus="Completed" or TransformationTasks.ExternalStatus="Done") order by TransformationFiles.FileID ; 

+--------------------------------------------------------------------+----------+--------+----------+----------------+------------+
| LFN                                                                | FileID   | TaskID | Status   | ExternalStatus | ExternalID |
+--------------------------------------------------------------------+----------+--------+----------+----------------+------------+
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00000258_1.sdst | 12699074 |   4513 | Assigned | Completed      | 21922269   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005178_1.sdst | 13057230 |   3152 | Assigned | Completed      | 21888392   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005230_1.sdst | 13057257 |   3191 | Assigned | Completed      | 21888492   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005257_1.sdst | 13057268 |   3157 | Assigned | Completed      | 21888406   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005813_1.sdst | 13057279 |   3322 | Assigned | Completed      | 21888770   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005831_1.sdst | 13057291 |   3420 | Assigned | Completed      | 21890235   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005893_1.sdst | 13057335 |   3334 | Assigned | Completed      | 21889989   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005923_1.sdst | 13057349 |   3400 | Assigned | Completed      | 21890173   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005203_1.sdst | 13058177 |   3222 | Assigned | Completed      | 21888574   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005210_1.sdst | 13058182 |   3155 | Assigned | Completed      | 21888400   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005248_1.sdst | 13058197 |   3178 | Assigned | Completed      | 21888459   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005250_1.sdst | 13058199 |   3147 | Assigned | Completed      | 21888380   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005808_1.sdst | 13058206 |   3444 | Assigned | Completed      | 21890307   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005840_1.sdst | 13058219 |   3429 | Assigned | Completed      | 21890263   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005867_1.sdst | 13058224 |   3304 | Assigned | Completed      | 21888736   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005887_1.sdst | 13058228 |   3360 | Assigned | Completed      | 21890062   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005917_1.sdst | 13058242 |   3393 | Assigned | Completed      | 21890155   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005943_1.sdst | 13058252 |   3384 | Assigned | Completed      | 21890129   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005976_1.sdst | 13058263 |   3414 | Assigned | Completed      | 21890215   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00005991_1.sdst | 13058268 |   3339 | Assigned | Completed      | 21890002   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00004539_1.sdst | 13121700 |   9192 | Assigned | Completed      | 21970280   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00004100_1.sdst | 13123716 |   6213 | Assigned | Done           | 21938993   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00009523_1.sdst | 13123914 |   6374 | Assigned | Completed      | 21940493   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012664_1.sdst | 13129031 |   7069 | Assigned | Completed      | 21945808   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012670_1.sdst | 13129037 |   7071 | Assigned | Completed      | 21945814   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012686_1.sdst | 13129053 |   7059 | Assigned | Completed      | 21945781   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012694_1.sdst | 13129061 |   6947 | Assigned | Completed      | 21945565   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012701_1.sdst | 13129068 |   7051 | Assigned | Completed      | 21945761   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012722_1.sdst | 13129089 |   7070 | Assigned | Completed      | 21945811   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012730_1.sdst | 13129097 |   7066 | Assigned | Completed      | 21945800   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012731_1.sdst | 13129098 |   7054 | Assigned | Completed      | 21945766   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012732_1.sdst | 13129099 |   6897 | Assigned | Completed      | 21944237   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012740_1.sdst | 13129107 |   7068 | Assigned | Completed      | 21945806   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012760_1.sdst | 13129127 |   7090 | Assigned | Completed      | 21945865   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012786_1.sdst | 13129153 |   7057 | Assigned | Completed      | 21945774   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012794_1.sdst | 13129161 |   7062 | Assigned | Completed      | 21945790   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012797_1.sdst | 13129164 |   7061 | Assigned | Completed      | 21945787   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012802_1.sdst | 13129169 |   7037 | Assigned | Completed      | 21945731   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012805_1.sdst | 13129172 |   7049 | Assigned | Completed      | 21945757   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012808_1.sdst | 13129175 |   6910 | Assigned | Completed      | 21944267   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012815_1.sdst | 13129182 |   7076 | Assigned | Done           | 21945828   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012850_1.sdst | 13129217 |   6950 | Assigned | Completed      | 21945573   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00012871_1.sdst | 13129238 |   6966 | Assigned | Completed      | 21945608   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002142_1.sdst | 13129945 |   7351 | Assigned | Done           | 21949684   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002143_1.sdst | 13129946 |   7444 | Assigned | Completed      | 21949888   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002146_1.sdst | 13129949 |   7459 | Assigned | Completed      | 21949909   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002167_1.sdst | 13129970 |   7322 | Assigned | Completed      | 21948292   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002170_1.sdst | 13129973 |   7428 | Assigned | Completed      | 21949861   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002171_1.sdst | 13129974 |   7432 | Assigned | Completed      | 21949868   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002173_1.sdst | 13129976 |   7318 | Assigned | Completed      | 21948280   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002177_1.sdst | 13129980 |   7437 | Assigned | Done           | 21949877   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002178_1.sdst | 13129981 |   7330 | Assigned | Completed      | 21948315   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002179_1.sdst | 13129982 |   7347 | Assigned | Completed      | 21949672   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002185_1.sdst | 13129988 |   7449 | Assigned | Completed      | 21949894   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002186_1.sdst | 13129989 |   7287 | Assigned | Completed      | 21948196   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002187_1.sdst | 13129990 |   7324 | Assigned | Completed      | 21948298   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002193_1.sdst | 13129996 |   7420 | Assigned | Completed      | 21949848   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002194_1.sdst | 13129997 |   7294 | Assigned | Done           | 21948210   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002201_1.sdst | 13130004 |   7407 | Assigned | Completed      | 21949823   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002202_1.sdst | 13130005 |   7303 | Assigned | Completed      | 21948235   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002208_1.sdst | 13130011 |   7278 | Assigned | Completed      | 21948177   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002209_1.sdst | 13130012 |   7410 | Assigned | Done           | 21949829   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002218_1.sdst | 13130021 |   7368 | Assigned | Completed      | 21949735   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002221_1.sdst | 13130024 |   7441 | Assigned | Completed      | 21949884   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002223_1.sdst | 13130026 |   7451 | Assigned | Completed      | 21949897   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002225_1.sdst | 13130028 |   7336 | Assigned | Completed      | 21949643   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002229_1.sdst | 13130032 |   7399 | Assigned | Done           | 21949809   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002235_1.sdst | 13130038 |   7308 | Assigned | Completed      | 21948251   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002241_1.sdst | 13130044 |   7328 | Assigned | Completed      | 21948310   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002242_1.sdst | 13130045 |   7415 | Assigned | Completed      | 21949839   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002244_1.sdst | 13130047 |   7313 | Assigned | Completed      | 21948266   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002254_1.sdst | 13130057 |   7372 | Assigned | Completed      | 21949747   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002266_1.sdst | 13130069 |   7317 | Assigned | Completed      | 21948277   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002270_1.sdst | 13130073 |   7310 | Assigned | Completed      | 21948257   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002271_1.sdst | 13130074 |   7345 | Assigned | Completed      | 21949666   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002272_1.sdst | 13130075 |   7390 | Assigned | Completed      | 21949794   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002273_1.sdst | 13130076 |   7387 | Assigned | Done           | 21949787   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002276_1.sdst | 13130079 |   7376 | Assigned | Completed      | 21949757   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002278_1.sdst | 13130081 |   7398 | Assigned | Completed      | 21949808   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002281_1.sdst | 13130084 |   7285 | Assigned | Completed      | 21948191   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002285_1.sdst | 13130088 |   7264 | Assigned | Completed      | 21948151   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002288_1.sdst | 13130091 |   7418 | Assigned | Completed      | 21949845   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002292_1.sdst | 13130095 |   7455 | Assigned | Completed      | 21949903   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002297_1.sdst | 13130100 |   7288 | Assigned | Completed      | 21948198   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002298_1.sdst | 13130101 |   7435 | Assigned | Completed      | 21949873   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002305_1.sdst | 13130108 |   7379 | Assigned | Completed      | 21949765   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002307_1.sdst | 13130110 |   7265 | Assigned | Completed      | 21948153   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002308_1.sdst | 13130111 |   7386 | Assigned | Completed      | 21949785   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002313_1.sdst | 13130116 |   7446 | Assigned | Completed      | 21949891   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002317_1.sdst | 13130120 |   7430 | Assigned | Completed      | 21949865   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002332_1.sdst | 13130135 |   7348 | Assigned | Done           | 21949675   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002336_1.sdst | 13130139 |   7433 | Assigned | Completed      | 21949870   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002338_1.sdst | 13130141 |   7326 | Assigned | Completed      | 21948305   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002339_1.sdst | 13130142 |   7440 | Assigned | Completed      | 21949882   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002341_1.sdst | 13130144 |   7395 | Assigned | Completed      | 21949802   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002343_1.sdst | 13130146 |   7442 | Assigned | Done           | 21949885   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00002993_1.sdst | 13131254 |   7887 | Assigned | Completed      | 21954701   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003009_1.sdst | 13131270 |   7889 | Assigned | Completed      | 21954708   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003011_1.sdst | 13131272 |   7856 | Assigned | Done           | 21954609   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003047_1.sdst | 13131308 |   7854 | Assigned | Completed      | 21954605   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003048_1.sdst | 13131309 |   7862 | Assigned | Completed      | 21954627   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003070_1.sdst | 13131331 |   7851 | Assigned | Completed      | 21954595   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003076_1.sdst | 13131337 |   7857 | Assigned | Completed      | 21954612   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003099_1.sdst | 13131360 |   7885 | Assigned | Completed      | 21954694   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003102_1.sdst | 13131363 |   7868 | Assigned | Completed      | 21954645   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003115_1.sdst | 13131376 |   7847 | Assigned | Completed      | 21954582   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003126_1.sdst | 13131387 |   7884 | Assigned | Done           | 21954692   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00003135_1.sdst | 13131396 |   7870 | Assigned | Done           | 21954650   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00008506_1.sdst | 13135388 |   8084 | Assigned | Done           | 21956734   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00013330_1.sdst | 13141109 |   8553 | Assigned | Done           | 21961525   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00013335_1.sdst | 13141114 |   8573 | Assigned | Completed      | 21961545   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00013337_1.sdst | 13141116 |   8646 | Assigned | Completed      | 21961650   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00013467_1.sdst | 13141245 |   8513 | Assigned | Done           | 21960513   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00013510_1.sdst | 13141288 |   8622 | Assigned | Completed      | 21961601   |
| /lhcb/LHCb/Collision11/SDST/00010883/0001/00010883_00013525_1.sdst | 13141303 |   8592 | Assigned | Completed      | 21961564   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00001405_1.sdst | 13143948 |   8887 | Assigned | Completed      | 21963134   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00001423_1.sdst | 13143966 |   8988 | Assigned | Completed      | 21964295   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00001523_1.sdst | 13144064 |   9059 | Assigned | Completed      | 21964438   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00001550_1.sdst | 13144090 |   9092 | Assigned | Done           | 21964513   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00001584_1.sdst | 13144122 |   9083 | Assigned | Completed      | 21964493   |
| /lhcb/LHCb/Collision11/SDST/00010883/0000/00010883_00009308_1.sdst | 13144185 |   8956 | Assigned | Done           | 21964250   |
+--------------------------------------------------------------------+----------+--------+----------+----------------+------------+
121 rows in set (53.28 sec)

Another useful query to join TransfromationTasks and TransformationFiles table (and not the DataFiles, like in the previous query):

> select TransformationFiles.FileID,TransformationFiles.LastUpdate, TransformationTasks.TaskID, TransformationFiles.Status, TransformationTasks.ExternalStatus,TransformationTasks.ExternalID from TransformationFiles JOIN TransformationTasks ON (TransformationTasks.TaskID=TransformationFiles.TaskID and TransformationFiles.TransformationID=TransformationTasks.TransformationID)  where TransformationFiles.TransformationID=19146 and TransformationFiles.Status="Assigned" order by TransformationFiles.FileID ;
+----------+---------------------+--------+----------+----------------+------------+
| FileID   | LastUpdate          | TaskID | Status   | ExternalStatus | ExternalID |
+----------+---------------------+--------+----------+----------------+------------+
| 26706100 | 2012-08-01 09:16:48 |  24570 | Assigned | Rescheduled    | 35205201   |
| 26706101 | 2012-08-01 09:16:48 |  24570 | Assigned | Rescheduled    | 35205201   |
| 26706102 | 2012-08-01 09:16:48 |  24570 | Assigned | Rescheduled    | 35205201   |
| 26706103 | 2012-08-01 09:16:48 |  24570 | Assigned | Rescheduled    | 35205201   |
...

Productions with some Unused files, which have run number = zero in production DB

Sometimes it happened that some files, even after flushing a production, are 'Unused', because their run number is zero in the ProductionDB.TransformationFiles table (though in the Bookkeeping it's a valid number!). It's a race condition: the TransformationAgent picks up the files from ProductionDB before the BookkeepingWatchAgent had inserted the run number.

As an example: this elog. What to do:

 $ dirac-transformation-debug 16771 --Status Unused

 ==============================
 Transformation 16771 : RADIATIVE.DST_Merging_Request7299_Stripping17b_90000000_1.xml of type Merge (plugin MergeByRunWithFlush, GroupSize: 5) in Stripping17b
 BKQuery: {'FileType': 'RADIATIVE.DST', 'ProductionID': 16764L, 'DataQualityFlag': 'OK'}
 9 files found with status Unused

 9 files have run number 0, use --FixIt to get this fixed

 $ dirac-transformation-debug 16771 --Status Unused --FixIt

 ==============================
 Transformation 16771 : RADIATIVE.DST_Merging_Request7299_Stripping17b_90000000_1.xml of type Merge (plugin MergeByRunWithFlush, GroupSize: 5) in Stripping17b
 BKQuery: {'FileType': 'RADIATIVE.DST', 'ProductionID': 16764L, 'DataQualityFlag': 'OK'}
 9 files found with status Unused

 Successfully fixed run number for 9 files

Productions with some 'Unused' files as they only have a failover replica in the LFC but no 'regular' replica

It can happen that a file fails to be replicated to a regular SE and only has a failover replica. In this case it will not be processed by the transformation, and the file will stay in Unused status. Example:

dirac-dms-lfn-replicas -a
/lhcb/LHCb/Collision11/CHARMCOMPLETEEVENT.DST/00016983/0001/00016983_00017390_1.CharmCompleteEvent.dst
Successful  : 
    /lhcb/LHCb/Collision11/CHARMCOMPLETEEVENT.DST/00016983/0001/00016983_00017390_1.CharmCompleteEvent.dst  : 
        RAL-FAILOVER  :  (-)
srm://srm-lhcb.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod/lhcb/failover/lhcb/LHCb/Collision11/CHARMCOMPLETEEVENT.DST/00016983/0001/00016983_00017390_1.CharmCompleteEvent.dst

to see what is the SE where it should be replicated, get the run number for these file:

mysql> select RunNumber from TransformationFiles where FileID = (select FileID from DataFiles where LFN='/lhcb/LHCb/Collision11/CHARMCOMPLETEEVENT.DST/00016983/0001/00016983_00017390_1.CharmCompleteEvent.dst');
+-----------+
| RunNumber |
+-----------+
|     98269 |
+-----------+

and look where are the other files of this run (runs are at one site only):

mysql> select distinct UsedSE from TransformationFiles where TransformationID=16986 and  RunNumber=98269 and Status='Processed';
+------------+
| UsedSE     |
+------------+
| GRIDKA-DST |
+------------+
1 row in set (0.09 sec)

how to fix it: check in the RequestDB if there is any pending request for that transfer, and try to understand why it is not happening. In this case, the request was not there. So the solution is to replicate manually:

 > dirac-dms-replicate-lfn /lhcb/LHCb/Collision11/CHARMCOMPLETEEVENT.DST/00016983/0001/00016983_00017390_1.CharmCompleteEvent.dst GRIDKA-DST
**** Set replica flag on Bookkeeping/BookkeepingManager
{'Failed': {},
 'Successful': {'/lhcb/LHCb/Collision11/CHARMCOMPLETEEVENT.DST/00016983/0001/00016983_00017390_1.CharmCompleteEvent.dst': {'register': 5.94293212890625,
                                                                                                                          'replicate': 15.704391002655029}}}

finally flush the production, and a new job should be created.

Productions with some jobs in 'Rescheduled' status since many days

Checking why some files are still in status 'Assigned', I found a job reported in ProductionDB as Rescheduled, since more than one month:

mysql> SELECT * FROM TransformationTasks WHERE TransformationID = 16994 AND TaskID=615;
+--------+------------------+----------------+------------+------------+---------------------+---------------------+-----------+
| TaskID | TransformationID | ExternalStatus | ExternalID | TargetSE   | CreationTime        | LastUpdateTime      | RunNumber |
+--------+------------------+----------------+------------+------------+---------------------+---------------------+-----------+
|    615 |            16994 | Rescheduled    | 31411895   | GRIDKA-DST | 2012-04-12 14:36:55 | 2012-04-12 16:31:19 |    101855 |
+--------+------------------+----------------+------------+------------+---------------------+---------------------+-----------+
not clear why, but might be a problem with Gridka batch system, as another job is found in same situation at the time:
mysql> SELECT * FROM TransformationTasks WHERE TransformationID = 16997 AND TaskID=632;
+--------+------------------+----------------+------------+------------+---------------------+---------------------+-----------+
| TaskID | TransformationID | ExternalStatus | ExternalID | TargetSE | CreationTime | LastUpdateTime | RunNumber |
+--------+------------------+----------------+------------+------------+---------------------+---------------------+-----------+
| 632 | 16997 | Rescheduled | 31405549 | GRIDKA-DST | 2012-04-12 08:42:43 | 2012-04-12 16:31:19 | 101776 |
+--------+------------------+----------------+------------+------------+---------------------+---------------------+-----------+ 

what to do: kill the jobs, then the input files will be automatically reset Unused.

Problem killing the job through the JobMonitor interface, sets the status 'killed' in the JobDB, but for some reason the status is not propagated to the ProductionDB, where it is still Rescheduled. The responsible of the propagation of job status is WorkflowTaskAgent and usually takes few minutes.

Temporary Solution Finally, it was not possible to understand why these jobs are stuck in status 'Rescheduled', however a practical solution is to update the ProductionDB and set the task's ExternalStatus = killed by hand:

mysql> update TransformationTasks set ExternalStatus='Killed' WHERE TransformationID = 16994 AND TaskID=615;
shortly after, the files are reset as 'Unused' and a new task is created, and then a new job submitted.

fixed on 31/08/2012 solution here

Production with some Assigned files, though the Dirac job is reported as Done

In production 18310 (Aug 2012) and also in prod. 18117 ( see http://lblogbook.cern.ch/Operations/12057) some files are in Assigned status and the relative jos status is Done:

mysql> select ExternalStatus,ExternalID,TargetSE,CreationTime,LastUpdateTime from TransformationTasks where  TransformationID=18310 and TaskID in (15954,15957,10580,8055,73965) order by ExternalID;
+----------------+------------+----------+---------------------+---------------------+
| ExternalStatus | ExternalID | TargetSE | CreationTime        | LastUpdateTime      |
+----------------+------------+----------+---------------------+---------------------+
| Done           | 32608697   | CERN-RAW | 2012-05-20 06:13:41 | 2012-05-21 22:03:13 |
| Done           | 32612616   | CERN-RAW | 2012-05-20 10:30:27 | 2012-05-20 16:16:59 |
| Done           | 32675342   | CERN-RAW | 2012-05-21 21:55:34 | 2012-05-22 14:33:06 |
| Done           | 32675345   | CERN-RAW | 2012-05-21 21:55:35 | 2012-05-22 14:33:06 |
| Failed         | 33680504   | RAL-RAW  | 2012-06-17 17:19:43 | 2012-06-19 10:26:16 |
+----------------+------------+----------+---------------------+---------------------+
5 rows in set (0.00 sec)

3 of the Done jobs at CERN ( 32608697,32675342,32675345) are in Done status but the application status is 'Brunel Exited With Status 137'! (though in Brunel_00018310_00015957_1.log no error message is reported). In std.out the error status is detected:

2012-05-22 14:14:11 UTC dirac-jobexec/GaudiApplication  INFO: Status after the application execution is 137
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication ERROR: Brunel execution completed with errors
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication ERROR: ==================================
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication ERROR:  StdError:
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication ERROR:
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication ERROR: /bin/sh: line 3: 28327 Killed
gaudirun.py $APPCONFIGOPTS/Brunel/DataType-2012.py prodConf_Brunel_00018310_00015957_1.py
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication ERROR: Brunel Exited With Status 137
2012-05-22 14:14:12 UTC dirac-jobexec/GaudiApplication  INFO: ===== Terminating $Id: GaudiApplication.py 50947
2012-04-17 15:54:09Z fstagni $ =====
2012-05-22 14:16:24 UTC dirac-jobexec/AnalyseLogFile  INFO: ===== Executing $Id: AnalyseLogFile.py 50940
2012-04-17 14:13:28Z fstagni $ =====

However, check if there are descendants and if not, the jobs can be set to Killed in production db and this should reset the files to unused.

Explanation: these status (completed, done) are set by the JobWrapper, not by the Workflow. The workflow is indeed correct, it reports errors and so on. But the process was killed, in all these cases, so this points to a bug in the Watchdog. The difficult part is that for debugging it would be VERY useful to have the pilot output, which unfortunately does not exist. Opened a github issue (04/09/2012).

Temporary Solution In such case, the input files of the job should be reset to Unused using the command: dirac-transformation-reset-files. This will of course create an inconsistent situation in Production DB because the relative job is reported as Done.

waiting for the fix.

Production with some Assigned files and corresponding Tasks have external status Failed

Example, in production 18979 we found two files in status Assigned, but the corresponding task is reported as Failed:

mysql> select TransformationFiles.FileID, TransformationFiles.RunNUmber,TransformationFiles.LastUpdate,TransformationTasks.TaskID, TransformationFiles.Status, TransformationTasks.ExternalStatus,TransformationTasks.ExternalID from TransformationFiles JOIN TransformationTasks ON (TransformationTasks.TaskID=TransformationFiles.TaskID and TransformationFiles.TransformationID=TransformationTasks.TransformationID)  where TransformationFiles.TransformationID=18979 and TransformationFiles.Status='Assigned' and  TransformationTasks.ExternalStatus='Failed' order by TransformationTasks.ExternalID;
+----------+-----------+---------------------+--------+----------+----------------+------------+
| FileID   | RunNUmber | LastUpdate          | TaskID | Status   | ExternalStatus | ExternalID |
+----------+-----------+---------------------+--------+----------+----------------+------------+
| 25982109 |    120199 | 2012-07-10 15:30:26 |   5423 | Assigned | Failed         | 34596253   |
| 26192544 |    120539 | 2012-07-13 10:43:12 |  15506 | Assigned | Failed         | 34638025   |
+----------+-----------+---------------------+--------+----------+----------------+------------+

Why aren't the input files reset to Unused?

DataRecoveryAgent (DRA) clearly skips this job:

2012-08-28 10:27:42 UTC Transformation/DataRecoveryAgent  INFO: Removing jobID 34596253 from consideration until requests are completed
it removes the job from the list of jobs to be treated because it has pending requests which are not in status 'Done'. Looked in JobMonitor and in fact the jobs has 2 diset requests, but in status Done. So it is not clear why the DRA agent thinks are pending!

TO BE UNDERSTOOD

Moreover, the two files have descendants, so they were actually processed:

 > dirac-transformation-debug --Info=Files,Jobs --Status=Assigned 18979
==============================
Transformation 18979 : DataReconstruction_Request8240_Reco13c_90000000_1.xml of type DataReconstruction (plugin AtomicRun) in Reco13c
BKQuery: {'StartRun': 119956L, 'ConfigName': 'LHCb', 'EndRun': 123803L, 'EventType': 90000000L, 'FileType': 'RAW', 'ProcessingPass': 'Real Data', 'DataQualityFlag': ['UNCHECKED', 'OK'], 'ConfigVersion': 'Collision12', 'DataTakingConditions': 'Beam4000GeV-VeloClosed-MagUp'}
2 files found with status Assigned

LFN: /lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/120199/120199_0000000139.raw - Run: 120199 - Status: Assigned - UsedSE: GRIDKA-RAW - ErrorCount: 0
LFN: /lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/120539/120539_0000000168.raw - Run: 120539 - Status: Assigned - UsedSE: IN2P3-RAW - ErrorCount: 1
List of jobs found:
34638025 34596253

For the one of the files, DataRecoveryAgent (DRA) reports that it can't be reset to Unused because it has descendants for that production:

2012-09-03 14:13:23 UTC Transformation/DataRecoveryAgent  INFO: !!!!!!!! Note that transformation 18979 has descendents with         BK replica flags for files that are not marked as processed !!!!!!!!
2012-09-03 14:13:23 UTC Transformation/DataRecoveryAgent  INFO: Job 34638025, Files ['/lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/120539/120539_0000000168.raw']
for the other file, no mention in DRA. To be understood why.

Solution there is no solution for this case, unless updating by hand the status of the files in ProductionDB (stronlgy not recommended). Possible solution: make DRA update the status of the file in ProductionDB to Processed, or add a functionality to some command line tool (like dirac-transformation-debug) to check if a file in status = Processed has descendants for the given production and in such case update the status of the file in the ProductionDB. Currently, there is no procedure to fix this case.

TO BE FIXED

Stripping productions with some Unused files after several days

It can happen that some stripping production have some files Unused, even after several days that the files have been assigned to the production. This can happen also if the production has been created with the ByRunWithFlush plugin, which in principle should schedule jobs also for files that are less than the group size (usually we set 2 files per job for stripping and 5 files per job for the femto stripping, but it can change).

In order to unblock the situation it is necessary to select the production and Flush it. Important: Selecting a single run and flushing it will have no effect.

Corrupted of lost input files

It may happen that some jobs repeatedly fail due to missing or corrupted input file. This happens often for merging jobs, which have one replica, and as it is only little statistics, it is preferable to set the file problematic, instead of reproducing it. See this example, the file is corrupted so it should be set problematic also in the LFC, not only for this particular production. The file has only one replica:
> dirac-dms-lfn-replicas /lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst
Successful  : 
    /lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst  : 
        IN2P3-BUFFER  :  srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/buffer/lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst
so the command to set it problematic is:
 > dirac-dms-set-problematic-files /lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst -S IN2P3-BUFFER
Now processing 1 files 
Getting replicas from FC (chunks of 1000): . 
Checking with BK 
Checking with Transformation system (chunks of 100): . 

Replicas set (P) in FC for 1 files 

Replica flag removed in BK for 1 files 

1 files were set Problematic in the transformation system 
        1 files set Problematic for transformation 19169 
Execution completed in 12.83 seconds 
as it's the only replica, the Bookkeeping replica is set to No:
> dirac-bookkeeping-file-metadata /lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst
FileName                                                                                             Size       GUID                                     Replica    LHCbInternal.DataQuality RunNumber 
/lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst 512931637  2016D74A-82E2-E111-A3B4-00266CF9B884     No         OK         124919    

replica is set problematic in the LFC:

 > dirac-dms-lfn-replicas -a /lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst
Successful  : 
    /lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst  : 
        IN2P3-BUFFER  :  (P) srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/buffer/lhcb/LHCb/Collision12/BHADRONCOMPLETEEVENT.DST/00019167/0002/00019167_00024559_1.BhadronCompleteEvent.dst
and in ProductionDB too.

File in status Assigned and relative job is Done: input data missing from JDL

In a stripping19c merging production was found a rare case of a file in status Assigned, whose relative job was Done. The problem was that the InputData field in the JDL was empty! so the job didn't actually process the input file, and produced an output file with no relevant data in it. In the Bkk summary there was no relation with the input file, so it was not registered as descendant of the input. See elog

Solution remove the output file with dirac-dms-remove-file, and reset to Unused the input.

Some comments and warnings

dirac-transformation-cli only update some status

This tool can only set some file status but not all. However, it will not give a warning! e.g. if a file needs to be set Unused it WILL NOT work:

(Cmd) setFileStatus 19770 /lhcb/LHCb/Collision12/FULL.DST/00019785/0002/00019785_00027742_1.full.dst Unused
Updated file status to Unused
but in reality the files is still in MaxReset in production db:
mysql> select Status, LastUpdate from TransformationFiles where FileID=(select FileID from DataFiles where LFN='/lhcb/LHCb/Collision12/FULL.DST/00019785/0002/00019785_00027742_1.full.dst') and TransformationID=19770;
+----------+---------------------+
| Status   | LastUpdate          |
+----------+---------------------+
| MaxReset | 2012-09-18 12:55:16 |
+----------+---------------------+
TO BE FIXED or at least should not print a misleading message.

the file can be reset with:

> dirac-transformation-reset-files 19770 MaxReset  --LFNs=/lhcb/LHCb/Collision12/FULL.DST/00019785/0002/00019785_00027742_1.full.dst
1 files were reset Unused in transformation 19770

No tool available to set the files status to Processed

Sometimes files are processed, but their status is not correctly updated in ProductionDB, see this example. So, you need to update the status of 150 files to Processed. The dirac-transformation-cli only allows to update one file at the time, and the dirac-transformation-reset-files only reset to Unused. Maybe this script could be adapted to update the status also to Processed.

TO BE DONE

No tool to retrieve all the failed jobs for a given LFN

When a production has some files in MaxReset it is important to check the reason of failure of all the jobs. Currently, no tool can provide that. The dirac-transformation-debug prints out only the last failed job. In order to have all the jobs, a query has to be done to the ProductionDB.

would be useful to have a tool able to report the full list of jobs for a given LFN. It would be even better if the tool could print out also the minor status and application status of the jobs. In this way one can avoid entering the jobs in job monitor, and would save a lot of time

-- FedericoStagni - 15-Nov-2010

Topic attachments
I Attachment History ActionSorted ascending Size Date Who Comment
PNGpng Req1926.png r1 manage 54.3 K 2010-11-15 - 12:41 FedericoStagni  

This topic: LHCb > WebHome > LHCbComputing > ProductionProcedures > Production100
Topic revision: r28 - 2018-09-23 - MarcoCattaneo
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback