Week of 060821

Open Actions from last week:

Chair: H.Renshall

Gmod: M.Dimou

Smod: J.van Eldik


Log: Nothing

New Actions: Andrea to coordinate database restarts. Jan and Maarten to look at workaround for r-gma producers dying.

Discussion: The Oracle cursor bug workaround of setting a cache size to zero has been found not to work. Instead a mutex option should be turned off. It was agreed to restart the LFC and VOMS data bases from 10.00. gridview data collection dropped over the weekend as producers died. Olof and James have been looking at slow transfers from T2 to CERN and found that routing via HTAR is associated. A suivre. From CMS most of their LCG MC data is in but there is 10 TB to come from OSG T2's which do not publish what FTS needs to work automatically.



New Actions:





Discussion: R-gma producers on CASTOR diskservers are now being restarted hourly awaiting a more permanent solution. Olof reported that the HTAR problem only affects incoming traffic and that he is working with CS group. M.Ernst reported that grouped SRM transfers to CERN, typically of 5 files, from US T2 sites nearly always fail because one of the transfers fails. P.Badino is working on this with Nebraska as a test site and found that push mode works better. Olof reminded that the supported architecture is for T2 sites to store and forward at their T1 sites. M.Ernst agreed CMS will eventually work this way.



Actions: HRR to inform GSSDLHCB of HTAR fix.

Discussion: OB reported HTAR was fixed by removing 3rd 'interface' that was added in July - between 13.00 and 14.00 today. He will look at current transfer failures with a view to following up with remote site admins. For the Nebraska problems an FTS channel was set up but it fails whereas Caltech to CERN works. Difference may br Caltech is not firewalled. Nebraska will try transfers to other US T2s. M.Ernst reported overnight imports went well but still having problems with US. Jamie confirmed the Oracle mutex workaround is correct - it skips the bad code. Harry reported that tuning of the GSSDLHCB gLite RB was going well e.g. the 1 TB sandbox file system has been removed from the overnight updatedb (for slocate) run.




Discussion: srm daemon was stopped by mistake at 12.00 for 1.5 hours. An alarm is now in place. Analysis of Nebraska problem points to a wrong config at their end as srm at CERN is trying to contact a 'private' machine there. M.Ernst reported that US sites have now injected about 5TB of data into Phedex and progress will be monitored. He appreciates the effect of the gridftp log parsing done by Olof with problem results being sent to site admins. It would be good to automate this (as far as possible).

