-- AndreiTsaregorodtsev - 15 Mar 2009

Site Failure Recovery

This page is describing procedures to be applied in case of the site failures. In the following is a brief reminder of the RAW data reconstruction procedure.

All the RAW data is stored at CERN. In addition to that, each Tier-1 site is supposed to get its share of the RAW data Di according to the LHCb Computing Model. The RAW data reconstruction is done in all the Tier-1 sites, and at CERN. Each reconstruction site is getting a share of reconstruction jobs Si which is proportional to Di with the correction that one of the Si shares is assigned to CERN: Si = Di * (1 - DCERN). The reconstructed rDST is stored at the site where the job was run. This rDST will be used for the Stripping step which must run at the site where the input data is present.

Raw Data Reconstruction failures

If a site is failing for whatever reason, it will be excluded temporarily from the production procedure. This is resulting in the following steps:

  • If the site outage is supposed to be short, for example less than one day, the site is simply removed from the Site Mask. The jobs submitted to this site will stay in the Waiting state until the site will be back in production. Once the site problems are corrected, the production is resumed in a usual way.

  • If the site is down for a longer period, its share of work should be redistributed over other sites having access to the RAW data. This is done by nullifying the site share Si and redistributing it over all the other sites. Note that jobs from the failing site can only be executed at CERN. However, some jobs that would be executed at CERN can now be moved to other Tier-1 sites. Therefore the following steps are undertaken:
    • The reconstruction site shares Si are updated in the Production Management system.
    • The Transformation Agent algorithm will ensure that the newly created jobs will be distributed as described above among the remaining sites.
    • The jobs in the Waiting state for site in question are deleted or killed if already running. The corresponding input files are marked as Unused in the reconstruction Production.
    • As soon as the problems at the site in question are corrected, the reconstruction shares are defined as for the normal operation. The reconstruction procedure resumes in a normal way.

  • An alternative solution is to always execute the failing site's quota of jobs at CERN. In this case the amount of reconstructed data at CERN can go high and the additional procedure to move the reconstructed data initially assigned to a failing Tier-1 site should be envisaged. This implies taggiing certain RDST files for a given site to be moved to the final site destination as soon as the site is back into production.

The two alternative scenarios above can be both considered. Comments are welcome.

It is important to note that the site failures affect not only the amount of jobs executed but also the amount of data stored at sites. If the site outages are relatively short ( the matter of days ), then the data shares will not be drastically affected. However, if a site fails for a longer period of time this should be taken into account in the Data Distribution procedure. In particular some of the reconstructed data can be moved to the recovered site ti allow for more Stripping and Analysis jobs to go there and to exploit fully the site storage.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2009-04-01 - AndreiTsaregorodtsev
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback