Difference: ClosingProcedure (4 vs. 5)

Revision 52016-10-25 - MarcoCorvo

Line: 1 to 1
 
META TOPICPARENT name="LHCbComputing"
Changed:
<
<
-- MarcoCorvo - 2016-10-07
>
>

Closing procedure for productions

 
Changed:
<
<
These steps should be done at the end of a production, that is when there are no more running jobs. The presence of Completed jobs must be treated separately
>
>
These steps should be done at the end of a production, that is when there are no more running jobs. There could be still some Completed jobs which btw must be treated separately.
 
  • Check file status on Transformation Monitor tab in the web portal of Dirac
Changed:
<
<
>
>
  • Look for file statuses like MaxReset, Unused or Assigned (Removed or Problematic are set elsewhere)
  • Try to figure out if some of these files are recoverable
 
Changed:
<
<
Before closing an official production (Reco, Stripping, Turbo or TurCal) some checks are needed to be performed in advance. A production can show files in different statuses, the most important being:
>
>
These files can be checked using the command
 
Changed:
<
<
>
>
dirac-transformation-debug <prodID> --Status <MaxReset|Assigned|Unused> --Info <files|jobs|flush>  
 
Changed:
<
<
These files can be checked using the command
>
>

Manage files in MaxReset

Many are the reasons why a file can be in MaxReset, namely:

  • file is corrupted
  • applications finish with errors
  • applications are unable to download input

If a file is corrupted (e.g. when a code 16 is returned by DaVinci) there's not much we can do. Just keep the file in MaxReset to avoid picking it up in case of further reprocessing.

If the application finishes with errors (e.g. DaVinci exits with code 134) the problem could be only a glitch or a mismatch in , e.g., a conddb tag. In this case a reset can be tried with:

dirac-transformation-reset-files <prodID> --LFNs=<list of lfns> 

In case af multiple input files (Merging or Turbo productions principally but also Reconstruction ones), it can happen that only one file is corrupted or problematic. The only thing to do is to isolate the affected file and reset the others. dirac-transformation-debug should tell which one is the file to isolate:

dirac-transformation-debug <prodID> --Status MaxReset --Info jobs 

and you can expect a result like:


[LHCbDirac v8r3p10] ~ $ dirac-transformation-debug 53934 --Status MaxReset --Info jobs
 Transformation 53934 (Idle) of type DataStripping (plugin ByRunWithFlush, GroupSize: 1) in Real Data/Reco16/Stripping26 
BKQuery: {'StartRun': 183890L, 'ConfigName': 'LHCb', 'EndRun': 184372L, 'EventType': 90000000L, 'FileType': 'RDST', 'ProcessingPass': 'Real Data/Reco16', 'Visible': 'Yes', 'DataQualityFlag': ['OK', 'UNCHECKED'], 'ConfigVersion': 'Collision16', 'DataTakingConditions': 'Beam6500GeV-VeloClosed-MagUp'}
 
3 files found with status ['MaxReset']

 1 LFNs: ['/lhcb/LHCb/Collision16/RDST/00053882/0004/00053882_00044998_1.rdst'] : Status of corresponding 6 jobs (ordered): 
142091039 142110901 142125753 142143210 142163591 142176729 
  6 jobs terminated with status: Failed; Requests done; DaVinci Exited With Status 134

 1 LFNs: ['/lhcb/LHCb/Collision16/RDST/00053882/0000/00053882_00000637_1.rdst'] : Status of corresponding 6 jobs (ordered): 
141358742 141410782 141446879 141478686 141530009 141567063 
  6 jobs terminated with status: Failed; Job stalled: pilot not running; DaVinci v41r2 step 1
     1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 24001. Record number within stream 1: 24001 
     1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 22001. Record number within stream 1: 22001 
     1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 20001. Record number within stream 1: 20001 
     1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 22001. Record number within stream 1: 22001 
     1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 25001. Record number within stream 1: 25001 
     1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 21001. Record number within stream 1: 21001 

 1 LFNs: ['/lhcb/LHCb/Collision16/RDST/00053882/0004/00053882_00041497_1.rdst'] : Status of corresponding 6 jobs (ordered): 
141924608 141937899 141963800 142001696 142022765 142069366 
  6 jobs terminated with status: Failed; Requests done; DaVinci Exited With Status 134

Summary of failures due to: Application Exited with non-zero status 
ERROR ==> /lhcb/LHCb/Collision16/RDST/00053882/0004/00053882_00044998_1.rdst was Partial (last event 15000) during processing from jobs 142091039,142110901,142125753,142143210,142163591,142176729 (sites LCG.CNAF.it,LCG.USC.es):  
ERROR ==> /lhcb/LHCb/Collision16/RDST/00053882/0004/00053882_00041497_1.rdst was Partial (last event 22000) during processing from jobs 141924608,141937899,141963800,142001696,142022765,142069366 (sites LCG.CERN.ch,LCG.RRCKI.ru):
 
Deleted:
<
<
dirac-transformation-debug <prodID> --Status <status> --Info <files|jobs|flush>  
 
Changed:
<
<
Many are the reasons why a file can be in MaxReset or Assigned, but to check if this is not a pathological situation, one can verify with:
>
>

If the situation is not pathological, e.g. the files keep staying in a non final status without evidence of a situation, try to check the following:

dirac-transformation-debug <prodID> --Status MaxReset --Info files  | dirac-production-check-descendants <prodID> 

This command line provides a summary of files which are the descendants of the ones in MaxReset status, if any, and gives also some suggestions on how to cure the situation. dirac-production-check-descendants also produces a list of affected files named CheckDescendantsResult_<ProdID>.txt. This file can be used when fixing problems, e.g.:

 grep InFailover CheckDescendantsResults_53197.txt | dirac-dms-replicate-to-run-destination --SE Tier1-DST  

This command selects those files which are in Failover, instead of their final destination, and replicates them to the right SE.

Manage files in Assigned status

 
Changed:
<
<
dirac-transformation-debug <prodID> --Status <MaxReset|Assigned> --Info files  | dirac-production-check-descendants <prodID> 
>
>
A file can remain stuck in Assigned status due to internal glitches of the system. Like for MaxReset files, the command to issue is:
 
Changed:
<
<
This command line provides a summary of files which are the descendants of the ones in MaxReset or Assigned status, if any, and gives also some suggestions on how to cure the situation.
>
>
dirac-transformation-debug <prodID> --Status Assigned --Info files  | dirac-production-check-descendants <prodID> 

Again the output of this command gives you some hints on how to fix problems.

Manage files in Unused status

  A file can also remain Unused basically for two reasons:
Line: 45 to 108
  where Prod2 is, for example, a Merging production and Prod1 is a Stripping one. In this case it's possible to understand whether the files are still Unused because the ancestors file where in some particular status, e.g. MaxReset.
Added:
>
>

Closing a production

If there are no more files in a bad shape that could be fixed, there is a final check on a given production that must be performed:

dirac-production-check-descendants <prodID>
 
Deleted:
<
<
To perform the final checks on a given production the LHCbDirac command to be used is
dirac-production-check-descendants <prodID>
 For Reco, Turbo or Turcal production, the command can complete in few hours, but for Stripping it can take much longer. To solve this problem these are the steps to perform when closing a production with final Merging steps (Stripping or Turbo), given that <prod1> is the Stripping/Turbo and <prod2> the Merging:
Line: 68 to 135
 

in case some files are not in the BK. \ No newline at end of file

Added:
>
>
If everything's ok, the final steps are to go to the Dirac web portal and mark the production "Complete", then set it "Done" in the Production Requests tab. This commands will leave the production in a "suspended" status for a week, just to let the experts have the time to resume it or to fix other issues.

-- MarcoCorvo - 2016-10-07

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback