How to Handle a Failing Storage Element

Storage Elements, SE, are used by different DIRAC systems in order to accomplished their respective tasks. When a given SE is known to be faulty each of the system need to be made aware of the situation in order to take appropriated actions:

  • The Failover system should not try to write/read data from a faulty SE.
  • The WorkloadManagement system should not allow jobs to start if the data is to be read from a faulty SE.
  • The Stager system should not attempt to stage files on a faulty SE.
  • The DataManagement system it is a special case, basic operations should be allowed to faulty SE, since they will be the final test to confirm the problem has been fixed; but bulk operations are not to be attempted.

This page should eventually present the procedure to handle this kind of situations. Currently it includes a working proposal.

The Proposal

The first point to decide is how to make the DIRAC system aware of a problem with a given SE. Given the fact that a faulty SE affects all DIRAC instances and systems the best option might be to use the Configuration System, CS:

  1. /Resource/StorageElements/FaultyRead|FaultyWrite=: this two options and their comments should allow to identify problematic SE's and the reason why they are failing. They should be manipulated using adhoc DIRAC admin commands. The same functionality should be made available through the web portal.
  2. dirac-admin-set-faulty-SE: declares a given SE faulty, either for read or write operations, updating the comment with a timestamp, the identity of the administrator, and a reason.
  3. dirac-admin-clear-faulty-SE: declares a given SE functional, either for read or write operations. A corresponding comment line should be added.
  4. dirac-admin-get-faulty-SEs: retrieves from the CS the current list of faulty SE's, together with the corresponding comments.
  5. dirac-admin-get-faulty-SE-history: retrieves from the CS the current status and comment lines for a given SE.

In a number of places DIRAC components should be made aware of this info and react accordingly. The following list might not be the final one but it should be considered as first iteration:

  • InputData splitter, InputData optimizer, ...: we need to decide if the list of read faulty SE's is applied in the underlying FileCatalog getReplicas method or on the client itself. In the first case, no operation with those replicas will be possible, except if a local configuration is defined for test purposes. In the second case we have a finer grain control, but at the same time a much more complicated logic, is it necessary?

  • Stager, ...: they use the StorageElement class, the same decision as above is to be taken, is the knowledge about faulty SE's applied in the base StorageElement class or in the clients code. For the same arguments I suggest the first approach.


-- RicardoGraciani - 07 Apr 2009

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-04-07 - RicardoGraciani
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback