Difference: ProductionJobFinalization (1 vs. 2)

Revision 22008-10-16 - AndreiTsaregorodtsev

Line: 1 to 1
 
META TOPICPARENT name="ProductionProcedures"
-- AndreiTsaregorodtsev - 16 Oct 2008
Line: 36 to 36
  It should be pointed out that this event is very unlikely, so the cleanup of the Bookkeeping records will never be necessary. However, we have to consider the possible consequences.

Changed:
<
<
5. Request to upload OutputSandbox std.out/err files to the Log Storage Element is created
>
>
5. Request to upload OutputSandbox std.out/err files to the Log Storage Element is created. REMARK: this is a special request for uploading full std.out/std.err files from the DIRAC OutputSandbox to the Log Storage Element. It is not replacing the standard DIRAC Job Wrapper operation of the Output Sandbox upload.
  6. The combined request is written into a file to be picked up by the Job Wrapper

Changed:
<
<

Case 2, Application failed

>
>

Case 2. Application failed

 
  1. Log files are uploaded to the Log Storage Element. If the upload fails, the log files are tarred and put into the Failover system. This is the first operation in order to ensure logs availability even in case of a crash in the subsequent steps.
  2. Input files status is updated in the Production DB Service as "Unused" except for files marked as "ApplicationCrash" by the application log analysis module. If the update fails, the corresponding failover request is created.

Revision 12008-10-16 - AndreiTsaregorodtsev

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="ProductionProcedures"
-- AndreiTsaregorodtsev - 16 Oct 2008

Production Job Finalization procedure

This is a proposed procedure which is simpler than the current one coded in the JobFinalization module. It simpifies the treatment of the eventual job requests in the RequestManagement system removing the need to define a certain order of the requests execution. Otherwise, the result of the procedure is the same.

There are 2 distinct cases:

  • application part finished successfully;
  • application part failed

Case 1. Application successful

1. Log files are uploaded to the Log Storage Element. If the upload fails, the log files are tarred and put into the Failover system. This is the first operation in order to ensure logs availability even in case of a crash in the subsequent steps.

2. Input files status is updated in the Production DB Service. If the update fails, the corresponding failover request is created

3. Bookkeeping records are sent to the Bookkeeping service. Currently, the information is sent to both the New and Old Bookkeeping services. If any of the bookkeeping records sending fails a corresponding failover request is created.

4. Output data upload. For each output file the destination Storage Elements are resolved according to the job workflow parameters. The upload to the specified destination is attempted with registration in all the DIRAC catalogs ( LFC, BookkeepingDB(replica flag), ProductionDB). If upload fails, the file is uploaded to one of the FAILOVER storages and the corresponding failover request is created.

If several destinations are specified for a given file, they are attempted in turn until the first successful upload. In this case replication requests are created to copy the file to other specified destinations. If all the specified destinations fail, the file is uploaded to one of the FAILOVER storages and the corresponding failover replication requests are created. REMARK: specifying several destinations for production job output data should be limited (if used at all) to some possible special cases. In general, the post-production centralized automated data distribution should be used.

The upload to the FAILOVER storage is only registered in the LFC catalog.

If for at least one output data file neither upload (destination or failover) is successful:

  • the job is declared Failed;
  • all previously defined replication requests are dropped except for log files failover request if any;
  • data removal requests are created for already uploaded files;
  • the input data files are set to "Unused" in the ProductionDB;
  • the already sent bookkeeping records are invalidated ( this procedure is to be defined providing a necessary Bookkeeping Service interface).

It should be pointed out that this event is very unlikely, so the cleanup of the Bookkeeping records will never be necessary. However, we have to consider the possible consequences.

5. Request to upload OutputSandbox std.out/err files to the Log Storage Element is created

6. The combined request is written into a file to be picked up by the Job Wrapper

Case 2, Application failed

  1. Log files are uploaded to the Log Storage Element. If the upload fails, the log files are tarred and put into the Failover system. This is the first operation in order to ensure logs availability even in case of a crash in the subsequent steps.
  2. Input files status is updated in the Production DB Service as "Unused" except for files marked as "ApplicationCrash" by the application log analysis module. If the update fails, the corresponding failover request is created.
  3. Request to upload OutputSandbox std.out/err files to the Log Storage Element is created
  4. The combined request is written into a file to be picked up by the Job Wrapper
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback