-- StuartPaterson - 15 Oct 2008

Production Procedures (Achieving 100% Processing Efficiency)

The motivation for this page is driven by the current experience of running data processing productions. In this case a reasonably high percentage of processed files is achieved quickly and automatically. However, to arrive at as high a percentage of processed files as possible requires a lot of human intervention. While several tools exist in DIRAC to interrogate relevant systems or services we have to ensure that all interfaces are available and combine them to arrive at 100% file processing efficiency.

For the MC case some of the steps below are not required but for clarity the data processing productions are concentrated on. The broad theme is that there are several phases to a data processing production e.g. production creation and commissioning, resolving the inputs into a final sane data sample and then integrity checking the production outputs.

This document is structured with the proposed production status fields in the context of each step. To ease readability:

  • Production statuses are defined in bold
  • Production DB statuses are defined in italics.

Pre-Production and Starting a Production (status New and Testing)

Productions can immediately proceed from New -> Active if a test production with reduced number of events has already been fully validated using the procedures described below.

Explicitly introducing the Testing phase will introduce a delay of at least the length of one job but does allow Grid shifters to participate in the commissioning phase.

The conditions for which the Production Manager may choose to set a production to the Testing status are:

  • Running a new production using an already proven workflow where no problems are expected
  • In order to share the load of commissioning new workflows with the Grid shifters.

In both cases it is necessary to formulate a check to validate individual production job outputs as a precursor to ramping up.

In summary, the transition from Testing -> Active should mean that a validation of the outputs of given test jobs or a job from the final production has been performed (as described below). The production jobID that was checked should be logged as part of the production history, this tool should be sufficiently robust to be used by the Grid Shifters and exposed through the CLI / Web Interface in the future.

Rather than performing an integrity check on input data for a given production it seems preferable to enter the files into the Production DB and allow the job creation (or lack thereof) to highlight the extent of problems with the sample. Subsequent updates from Data Management components will ensure arrival at a final data sample (see VerifyingInputs below).

Checking Inputs During Execution (status Active <-> VerifyingInputs)

If everything proceeds without any problems the Active <-> VerifyingInputs loop may not be required, however normally this will not be the case.

Production status VerifyingInputs is defined as a state where not all files are processed but nothing is happening e.g. no new jobs are being created and without some manual intervention the production is stuck. This means that some or potentially all of the following are true:

  • Some files will have exceeded the maximum number of processing attempts
  • Files in the Production DB will be marked as Problematic
  • There may be files that could not be used to create jobs due to BK inconsistency or ancestor file problems
  • Some files could not exist in the LFC
  • This list may not be exhaustive...

Procedures are required to deal with all of the above problems with production inputs and all require human intervention (at least initially, see below for more information). In some cases e.g. if some files are not recoverable this can result in marking files as Invalid and having them excluded from the sample, this would require a full post-mortem to understand the causes. With sufficient automation these operations could be performed by the Grid shifters.

In order to quickly proceed to the VerifyingInputs state without wasting compute resources the suggestion would be to set the maximum number of processing attempts to 2. So as not to suffer from transient problems an interface to reset the file status for certain conditions is required. One use-case would be for backend storage instabilities, it would be advantageous to reset files for a given production that are in the maximum attempts category for a given site.

Checking Production Outputs (status ValidatingOutputs)

Making the ValidatingOutputs production state explicit facilitates the eventual automation of the post-production integrity checking. This state also allows to distinguish between the possible looping between Active and VerifyingInputs as described above. As the last act before declaring the success or failure of a production request the ValidatingOutputs status signals the end of recovering any input files that were declared Invalid and moves the production into an output checking phase.

The ValidatingOutputs state must be arrived at only after the sample of processable files has been treated as in the VerifyingInputs step above. This means that 100% of the reduced input data sample is "Processed" in the Production DB and the production outputs will be examined as described below. This also implies that a subset of the original sample could be marked as Invalid for the production but these are not considered in the processing efficiency. Any files that were not processed in the original sample should be retrievable for a given production and reported to the Integrity DB for a post-mortem investigation (with possible intervention from the Data Manager if required).

The ValidatingOutputs state is therefore when all production jobs have no pending requests and all of the (potentially reduced) sample of data has been processed. At this point the Production DB file statuses should either be Processed or Invalid. It should be possible for shifters to obtain any errors for pending requests from the Logger or directly from the Request Management system for a given production.

Entering the ValidatingOutputs state should be triggered by hand and means that there is nothing more that the Production Management system can do to advance this production and signals the start of post-production integrity checking for the successful jobs. A procedure should be defined to perform asynchronous integrity checking of the outputs of a given production, eventually this could be automated as an Agent (polls for productions in the ValidatingOutputs state).

If a proportion of produced files are declared Corrupted by the Data Management system (e.g. some data management checks fail after the file is produced) it may be necessary to re-enter the Active <-> VerifyingInputs loop at this stage and create further jobs. In order not to repeat checks on files it may be advisable for statuses to be updated in the Production DB during the ValidatingOutputs phase e.g. Processed -> ProcessedAndChecked (or something shorter such as Successful) for the given production.

The production can be locked only at the point where no remaining Corrupted produced files are reported by the data management checks. In the case where corrupted files repeatedly appear from problematic input files they should be identified during the VerifyingInputs phase above. This highlights the need for reporting recovery operations to the data logging service.

Final state (status Done)

After ValidatingOutputs there can be problematic data that may be irrecoverable. In this case it should be possible to clean up after a job and make a transition from Processed to Invalid in the Production DB. The final status transition of a production should be ValidatingOutputs -> Done but in principle there could be a threshold limit for defining a Failed state e.g. if more than 20% of the final produced files are problematic (is this a use-case ? ).

Deleting productions (status Deleted)

There is a use-case for deleting full productions from BK, LFC, Production DB and physical storage but currently this is performed by hand. It would be advantageous to have an agent to perform this task after entering the production in a Deleted state. Only dropping the production from the Production Management system when all output files and WMS jobs are removed.

General and outstanding points regarding the above

  • With testing or other operations occurring for the production status transitions above it may be necessary to enforce production logging information
  • For the proposed Testing phase it may be advisable to restrict the number of submittable jobs
  • Procedures must be defined for data that is deemed Invalid at the VerifyingInputs phase
  • If it is possible that produced data is problematic, it may be necessary to trigger an ValidatingOutputs -> Active transition or complete removal of a production
  • If the above is agreed the production management system will have to keep track of the original data sample and the 'validated' data sample (also the Invalid files)
  • When files are marked as Invalid in the Production DB (e.g. not recoverable by data management procedures) they should not be considered in the processing efficiency but should still be retrievable
  • This list may not be exhaustive...

Notes on Verification Of Production Input Files, Individual Production Jobs and Production Outputs

This section is under construction and will be filled in by the data manager.

How to validate the outputs of a given production job

There should be an automatic procedure for validating a given production job based on the Production ID / Production Job ID or the equivalent WMS Job ID. This tool should check a successful job for a production in the Testing status and return either OK, allowing the production to enter the Active status or ERROR describing the problems. The following should be confirmed by this tool:

  • Correct input file status in the Production DB (if applicable)
  • Correct Production ID / Production Job ID status in the Production DB
  • Presence of output files in the LFC
  • Presence of output files in the Bookkeeping
  • Integrity of output files on the storage
  • This list may not be exhaustive...

How to perform integrity checks on production inputs

This section should outline how to construct the procedures for remedying the problems described in the VerifyingInputs stage

How to perform integrity checks on production outputs

An integrity suite is currently being prepared in the Data Management system. Other points to check are e.g. transformation has done what it should have done for this production.

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2008-10-21 - StuartPaterson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback