Open Actions from last week: Costin and Miguel to try to understand cyclic nature of ALICE FTS transfer rates.

Chair: Jamie


Smod: Ignacio



The sudden drop in the ALICE transfers to/from CERN yesterday evening at around 8pm, was due to a problem with the ALICE CASTOR2 instance: an unrecoverable and new error from the LSF API resulted in all requests to fail. The only possible remedy was a restart of the LSF master, which I did at ~09:30 this morning. We're looking into how to protect against this problem in the future. Olof

CMS report: Over the course of the last days we were focusing on improving Tier-1 to Tier-2 transfers and achieved some remarkable and very stisfactory results. We have seen rates from some Tier-1s (e.g. FZK to DESY) up to 200MB/s sustaining multiple hours and were able to replicate datasets simultaneously from a particular Tier-1 (i.e. PIC) to 18 different Tier-2s almost error-free.

These transfer tests that were carried out systematically (datasets were replicated from any Tier-1 to all Tier-2s participating in CSA06) have shown that the CMS data model is viable regarding the strategy that allows any CMS Tier-2 to request data from any CMS Tier-1.

Regarding job execution we succeeeded in getting more sites involved in the CSA06 analysis processing activities and the total job volume is now approaching 30k jobs/day. The efficiency (grid job submission + application) is above the goal of 90%. Michael

ATLAS: many source SRM transfer failures - to be followed up.

New Actions:



Log: Same CASTOR/LSF problem occurred in the evening, killing ALICE transfers again. Being followed up by IT-FIO. FTS/DB problems to be followed up (GD/PSS). ATLAS transfers out of CERN failing - follow up with Miguel Branco/Zhongliang Ren.

New Actions: See above.




Massive failures of ATLAS transfers mostly resolved as user error (source file doesn't exist). ATLAS have been informed.

Non-scheduled intervention noted for GridView (it's marked on the page). This should be scheduled according the agreed WLCG procedure.

Alice: Good data rates over night to all except CNAF. Big increase in data rate at 01:00 UTC. Not understood why..

CMS: Transfers to CNAF basically down. Will be followed up with Luca

ATLAS: follow up on massive failures (still source) to BNL and GridKA

Hi all, on November the 7th, we have planned a downtime for power mainteinance at INFN-T1 site.

It will start at 9:00 am (CET) and finish at 15:00 (CET).

For some services (FTS, LFC for ATLAS and ALICE and storage resource) the down will be shorter (some minutes).

Regards, Andrea on behalf of INFN-T1 staff


CNAF deployed a more reliable LFC server (redundant DNS load balanced). The new server name is lfc.cr.cnaf.infn.it This should be changed in ToA as soon as possible to benefit of the new service. At least my cache contains the old name. Cnaf people in cc: for questions or comments


New Actions:




Big increase in ALICE data rates seen yesterday correlate to big increase at one site for this VO (IN2P3)

FTS service: The db lock-up problem..

We're trying solution 2 for now:

PSS wiki page

New Actions:



Log: High load problems on LCG RAC:

The problem on Oracle LCG RAC is solved and the cause is not yet understood but it had the root on lcgr3, used mainly by lcg_fts_w, lcg_lfc and also by lcg_same, lcg_voms and lcg_shiva. Several logs were taken and we will open a service request with Oracle to understand better the problem.

If you believe your application changed this morning (some patch deployment or manual access with heavy queries), please report to us so it will be easier to identify to problem with Oracle.

At ~13:17 the c2alice scheduler started to report problems again.

Our freshly configured metric was sampled at 13:31, and found 521 error messages. One minute later the actuator restarted LSF, and the scheduler problems stopped. So the full recovery chain works.

New Actions:


