Problem reported by CMS (Daniele Bonacorsi, Giuseppe Lore) while transferring files from Pisa with dCache to CNAF with CASTOR.

CASTOR (version 2.1.0 or 2.1.1) does not handle gracefully the case where a failure takes place while copying a file to CASTOR. The cause of the failure can be broken srm 'put' cycles where the setFileStatus("Done") or even the transfer itself are missing. The setFileStatus("Done") == putDone often fails because it is associated with an LSF job and the client times out before the job runs. Furthermore, there is no timeout on put requests. Therefore, the system returns "Device Busy" for put requests on a file for which a successful "Done" has not being executed.

For files for which a successful "Done" has not being executed, the name server has an entry for that file with size 0. Because of the broken put cycle or the missing transfer, the stager queues up further requests for that file, waiting for the first operation to finish till the system gets overloaded. Further requests on the same file fail.

At CERN other "odd" behaviours have been observed where the first 'put' cycle is OK and the file is migrated to tape but soon after follows another 'put' cycle on the same file. The latter cycle is broken, usually before the transfer, but the filesize is reset to zero because a 'put' on an existing file truncates it. Quite a few such cases were found for ATLAS: in this case many transfers were taking place from different T2 sites all over Europe to CERN and sometimes the same file was 'put' from different sites with up to 2-3 days delay in between.

The problem can be cured removing the file that has failed with size 0 from the nameserver and trying the operation again.

Note : With SRM v2 requests have a lifetime and therefore the file is released/removed from the storage name space when the request expires. With the new release of CASTOR (2.1.3) PutDone operations are no longer scheduled.

21 Jun 2007

