LCG Management Board



Tuesday 9 June 2009 16:00-18:00 – F2F Meeting







(Version 1 – 14.6.2009)



A.Aimar (notes), I.Bird(chair), K.Bos, D.Britton, F.Carminati, T.Cass, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, M.Kasemann, P.Mato, P.McBride, G.Merino, A.Pace, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova



A.Di Girolamo, P.Mendez Lorenzo, R.Santinelli


Action List


Mailing List Archive


Next Meeting

Tuesday 16 June 2009 16:00-17:00 – Phone Meeting


1.   Minutes and Matters arising (Minutes)



1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.



2.   Action List Review (List of actions)


  • VOBoxes SLAs:
    • CMS: Several SLAs still to approve (ASGC, IN2P3, CERN and PIC).
    • ALICE: Still to approve the SLA with NDGF. Comments exchanged with NL-T1.

No progress for CMS.

NL-T1 had reported that ALICE replied positively with a few comments and NL-T1 has only to implement some minor changes.
NDGF is waiting for feedback from ALICE. F.Carminati added that need to be approved by ALICE.

  • M.Schulz will summarize the situation of the User Analysis WG in an email to the WLCG MB.

Done later in this meeting.

  • 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

Started, but not yet completed. L.Dell’Agnello added that the discussion is still on-going between CNAF and the Security team. He will check for feedback from the Security Team.

·         A.Aimar schedules a presentation at the F2F in June. Experiments will explain their VO SAM tests and the tests that are reported in their dashboards.


Done later in this meeting.

  • Tier-1 Site should start publishing the UserDN information.

J.Gordon will regularly add the list of Sites that publish the information. He asked the portal developers in order to have an anonymous list.

F.Hernandez commented that FR-IN2P3 have asked to the French authorities and are waiting for a reply. There is only to wait now.

·         Status of HEPSPEC benchmarking at each Tier-1 Site.

Round Table:

-       RAL: has the licences and start some benchmarking.

-       CNAF: started the collection of data.

-       IN2P3: bought a license to all French Sites and some Tier-2 Sites are performing the benchmarking. The Tier-1 Site will do it after STEP.

-       BNL: Carried out benchmarks and will send the information.

-       NDGF: Received the licenses but not run the benchmarks.

-       PIC: run the benchmarks on the CPUs and will send the results.

-       ASGC: Started benchmarking.

-       FZK: M.Alef is at FZK.

-       FNAL: Benchmarking is in progress.


A.Aimar noted that the instructions to post the results are in the wiki page ( or send the data to M.Alef who maintains the content of the wiki.


·         9 Jun 2009 - Sites should report to the MB whether now, after the GDB presentations, the situation of the data rates is clear.


F.Hernandez asked for the information from ALICE.

Data rates not presented by ALICE yet.


·         Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar


Not completed: 5 Sites have done it. Not all information has to be provided at the beginning; Sites can provide what they have at the moment. See

  • M.Schulz should report on the gLExec status and next steps.

Will be presented at the GDB the following day.

M.Schulz summarized it for the MB. The passing of the environment through gLExec is not a viable solution. The glExec developers now provide a script, similar to the one used by ATLAS. It is now submitted as a patch and will be distributed with glExec.


·         Experiments should send to J.Gordon the DNs of the people that can read the details of the users in the CESGA Portal.

Input only from LHCb received. They are trying the solution proposed.



3.   LCG Operations Weekly Report (Slides) – H.Renshall

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.

All daily meetings summaries are always available here:

3.1      Summary

This report covers the service for the two weeks period 24 May to 6 June. Therefore it includes the first week of Step’09.


The GGUS ticket rate is normal and no alarm tickets were issued. The problems are fewer but more complex to solve.

There were three Central Service incidents that occurred:

-       A bug in the upgrade script from CASTOR 2.1.7 to CASTOR 2.1.8, carried out for LHCb on 27 May, and coupled with an earlier redefinition of LHCb pool attributes resulted in the loss of some 7500 files (of reference Monte Carlo data). 

-       Port failure on 1 June in a router connecting RAC5 to GPN cut off production databases for several hours.

-       Following the 27 May LHCb incident a further 6500 files were lost on the 5 June during a manual operation to try and enable migrations for tape0disk1 files in order for them to become tape1disk1 and which exposed a pre-existing CASTOR bug.


VO concerned































Slide 4 shows the VO SAM tests for each Experiment one can notice that:

-       IN2P3 downtime for the HPSS migration

-       NL-T1 down for rollback dCache

-       CNAF for LHCb in slide 5, it mixes results for them CNAF Tier-1 and Tier-2; therefore the results are incorrect.


Note: CNAF is missing in the ATLAS plot!

3.2      Experiment Availability Issues

LHCb- CNAF Since some time the visualisation of the LHCb critical availability at CNAF was showing either the T1 or T2 randomly. This now shows the T1 only which exposed that there were two tests which could not both succeed hence the site was always red. This will be corrected. NB: all green in later SAM talk.


ATLA: ASGC Conditions Database streams replication to ASGC was re-established and synchronised on the RAL instance on 4 June. However, since then read-only access to the ASGC conditions database from the worker nodes has been failing stopping STEP’09 activities. It was realised that in the long period following Oracle reconfiguration at ASGC and the fire the port numbers for Oracle access had been changed without directly informing ATLAS operations.  Correcting this would mean recompiling/reconfiguring ATLAS modules during STEP09. As a workaround they are trying to set up a second Oracle ‘listener’ but as yet ASGC has hardly participated in STEP09 at the halfway stage. News Tuesday morning: jobs have started working and a bulk pre-staging has been triggered – high failure rate reported. Steady data flow from CERN at 70-80 MB/sec.


ATLAS: FZK – Have announced it is unlikely their tape layer will be able to participate in STEP09 though they are working hard on their problems. ATLAS has been working with disk based data but last week their FZK SE was working slowly and was in serious trouble over last weekend. Transfer rates are commensurate with a 5% Tier1, not 10%, with 850 datasets (around 30TB) to catch up which will be challenging before the end of STEP09. FZK latest report is that they have built an alternative SAN fabric using old (decommissioned) switches. With this SAN fabric their tape connection to dCache is quite stable; however they cannot achieve the rates necessary for STEP09.


LHCb + ATLAS: NL-T1 – Upgraded dCache to 1.9.2-5 25 May then found gsidcap doors failing with large memory consumption (8GB). Eventually downgraded back to dCache 1.9.0-10 on Wednesday 27 May.


ATLAS: IN2P3 – HPSS migration finished well as planned with the scheduled intervention over at 18.00 on Thursday 4 June and they are running a more recent version of HPSS on more powerful hardware.


ATLAS: T1-T1 transfer tests showed many FTS timeouts for the new large (3-5 GB) merged AOD files. FTS timeouts have been systematically increased at all sites to typically 2 hours with good success.


ATLAS: CERN – ATLAS discovered that small ‘user’ files are still being written to tape via the ‘default’ pool/service class.


ALICE: Batch jobs were failing to be scheduled at IN2P3 – traced to a BDII publishing failure at GRIF which hosts the French WMS for ALICE. This in turn has been reported as due to a site cooling problem and an alternative BDII is now being used.


LHCb: NL-T1: Had an NFS problem at NL-T1 affecting LHCb software access on worker nodes after a user somehow caused NFS locks (which should not happen). NL-T1 had to reboot the lock server plus the concerned worker nodes.

ALL: would like to understand what multi-VO activity is happening at sites and what its effects are.

3.3      Summary of STEP09


Summary of STEP09 GOALS


T0-T1 data replication (100MB/s)

Reprocessing with data recalled from tape at T1


Parallel test of all main ATLAS computing activities at the nominal

data taking rate

-Export data from T0

-Reprocessing and reconstruction at T1 ,tape reading and writing, post-reprocessing data export to T1 and further toT2

-Simulation at T2 (real MC)

-Analysis at T2 using 50% of T2 CPU, 25% pilot submission, 25% submitted via WMS


-T0 multi-VO tape recording

-T1, special focus on tape operations data archiving & pre-staging.

-Data transfer

-Analysis at T2


-Data injection into HLT

-Data distribution to Tier1

-Reconstruction at Tier1


“Though there are issues discovered on daily basis, so far STEP09 activities look good” - IT-GS “C5” report


Below are the rates reached for Tier-0 multi-VO tape writing:

-       ATLAS (320 MB/sec),

-       CMS (600 MB/sec) in multi-days bursts.

-       LHCb (70 MB/sec)

-       ALICE will start in the next days.


It is happening now since a few days then

-       CMS stop today (for weekly mid-week global run) and resume midnight 11 June,

-       ALICE (100 MB/sec) should start also on the 11 June.

-       ATLAS stops midnight Friday 12 June.


Would have liked a longer overlap but FIO reports success so far.

See GDB report in June for detailed performance.


I.Bird asked why CMS reached such a high rate.

M.Kasemann replied that they wanted to reach the rate to break the tape system. Still have not managed.

3.4      Central Services Issues

3.4.1       Oracle BigID Issues

Oracle has released a first version of the patch for the “BigID” issue (cursor confusion in some condition when parsing a statement which had already been parsed and for which the execution plan has been aged out of the shared pool). It was immediately identified that the produced patch is having a conflict with another important patch that we deploy. Oracle is working on another “merge patch” which should be made available in the coming days. In the meantime a work-around has been deployed in most of the instances at CERN as part of CASTOR 2.1.8-8.

3.4.2       Migration to LHCb CASTOR 2.1.8: Lost 7500 Files.

A bug in the upgrade script from CASTOR 2.1.7 to CASTOR 2.1.8 has converted Disk 1 Tape 1 (D1T1) pools to Disk 0 Tape 1 (D0T1) during the upgrade performed on 27 May. The garbage collection has been activated in pools that had a tape copy of the disk data. Unfortunately, one of the disk pools of LHCb happened to contain files with no copy on tape, as these files were created at a time when the pool was defined as Disk 1 Tape 0 (D1T0 – Disk only). As a consequence, 7564 files have been lost. A post mortem from FIO is available as mentioned in their C5 report (

3.4.3       LHCb CASTOR Service Degradation: Lost 6500 Files

On 5 June a script was run to convert the remaining misidentified LHCb tape0disk1 files to become tape1disk1 files and hence be migrated to tape. Due to wrong logic in the check for the number of replicas triggered by this operation 6548 tape0disk1 files that had multiple disk copies in a single service class have been lost.

Probably triggered by the same operation, corrupted sub requests blocked the three instances of the Job Manager. For this reason the service was degraded from 11:30 to 17:00.

Post Mortem is at

As a follow up whenever a bulk operation is to be executed on a large number of files, the standard practice should be to first run it on a small subsample of files (a few 100s), which have first been safely backed up somewhere outside or inside (as a different file) castor.

3.4.4       Network Failure: Lost Connectivity to Prod DB

On Monday June 1st, around 8:20 am, the XFP (hot pluggable optical transceiver) in the router port, which connects the RAC5 public switch to the General Purpose Network, failed causing unavailability of several production databases including: ATLR, COMPR, ATLDSC and LHCBDSC. Also data replication from online to offline (ATLAS) and from tier0 to tier1s (ATLAS and LHCb) was affected. The hardware problem was resolved around 10am and all aforementioned databases became available ~15 minutes later. Streams replication was restarted around 12:00.

During the whole morning some connection anomalies were also possible in the case of ATONR database (ATLAS online) which is connected to the affected switch for monitoring purposes. The XFP failure caused one of the Oracle listener processes to die. The problem was fixed around 12:30.

Detailed post-mortem has been published at


I.Bird asked whether there are the amounts of the data transfers and the rates achieved.

H.Renshall replied that the details are going to be presented at the GDB on the following day.


A.Heiss reported a mis-configuration of the gridftp server, with the new 10 Gb cards, in the dCache setup for ATLAS. The high traffic caused the problem mentioned but now it is solved and FZK has caught up on the work they were asked to perform.


I.Fisk asked whether the problems related to ATLAS are also valid for CMS. And if not, why not.

H.Renshall replied that ATLAS reported them more clearly; but they actually affected all Experiments.


J.Shiers stated that it was agreed several times that Sites should report and be present at the daily meeting. Sites must participate t the daily meetings. Even Sites that disagree should follow the MB decisions. At the daily meeting often ATALS and CMS are very frustrated from the lack of news and participation from several Sites.


J.Gordon noted also was agreed that Experiments should provide mailing lists where Sites can contact them urgently if needed.


4.   ALICE VO SAM Tests Review (Comments; Slides) - P.Mendez Lorenzo


4.1      SAM and ALICE

The current ALICE tests are affecting two services: LCG-CE and VOBOXES. ALICE has agreed a big upgrade of the SAM infrastructure for their specific tests. In collaboration with VECC developers.


It will affect the following services:

-       VOBOXES: Establishment of a more robust and upgraded test suite. GOAL: Include VOBOXES in the availability calculations

-       CREAM-CE: Inclusion of a specific ALICE test suite.

-       WMS: No specific tests planned.   


These tests must be in place before the data taking period.



The sensor LCG-CE is included in the site availability calculation. ALICE uses the standard SAM test suite created for this service but executes it with the ALICE credentials. Two critical tests defined are defined:

-       Job submission

-       Access to the shared sw area from the WN.

ALICE does not plan to change the test suite regarding this service



A specific test suite and SAM infrastructure were created to fulfil ALICE’s requirements. The test suite is executed at each VOBOX using ALICE credentials.

Currently the test suite consists of 5 tests, all of them critical:

-       PSR: Status of the proxy server registration

-       SA: Status of the sw area access

-       UPR: Status of the user proxy registration

-       PM: Status of the proxy of the machine

-       PR: Status of the alice-b-x-proxy renewal service

This test suite ensures the maximum granularity to verify the generic VOBOX service. It can be therefore exported to any other VO since it does not test any specific ALICE-related service.


ALICE Dashboard

As slide 5 shows, there are no inconsistencies between the SAM reported information and that of Dashboard.


ALICE Plans for SAM

ALICE will proceed to a restructuration of the SAM infrastructure for the Experiment, affecting the general wrapper and the sites registration procedure into the SAM DB in collaboration with SAM developers.


Then there will be the improvement of the VOBOX test suite and the creation of a specific CREAM-CE test suite dedicated to the direct job submission tests.



During the transition phase from the LCG-CE to the CREAM-CE ALICE is facing a mixed situation, not visible in SAM:

-       Sites providing both CREAM and LCG-CE

-       Sites providing only LCG-CE

-       Sites providing only CREAM-CE (not too many)


However the SAM Critical Tests now check the LCG-CE only. Therefore In order to have a more realistic situation, SAM will have to reflect the status of ALICE services at each site, i.e. including CREAM-CE tests. Increasing the cases for which CREAM-CE will have to act as the CT sensor and also included in the availability calculation


In most of the cases the availability calculation will have to be based on a situation: LCG-CE OR CREAM-CE. This is one of the SAM issues on which ALICE will work in the following months.


SAM and UI

Currently SAM is associated to a single specific account and the renewal of the user proxies are based on a cron job. The proposal is to include a proxy renewal mechanism a la VOBOX. This procedure will ensure a limited access to the UI (i.e. security ensured) and several developers can access the SAM UI (i.e. single point of failure also avoided).


I.Bird noted that if a Site is running the CREAM-CE the tests are meaningless for the moment. And asked when the new CREAM-CE tests will be in place.

P.Mendez Lorenzo confirmed this. The VOBOX tests can be provided quickly and the CREAM-CE tests should be there before August 2009. But at each Site they will have to be tuned with the help of the SAM developers.

4.2      Comments to the ALICE SAM Test for May 2009

Below the SAM availability calculated by the ALICE specific tests.



Most issues are very small.





Good behaviour during the period


Good behaviour during the period

FR-CCIN2P3 in maintenance by the 4th of May, showing in addition some matchmaking problems by the 5th of May. Similar situation for


On the 22nd temporal problems with the WMS used for the submission: Operation failed while trying to contact with the service.


On the 31st problems with both CEs at submission time:  local authentication errors


On the 31st both ce04 and ce06 are showing some issues at the local CEs due to matchmaking issues (WMS could not find available resources)


No access to the tests.

The reasons are unknown in SAM and in GridView.


On the 3rd of May temporal problems with the submitted jobs, request expiration probably at the level of the WMS.


24th and 25th several CEs problems associated to the following issues: the job failed when the job manager attempted to run it (both days)


2nd May: Request expiration, probably associated to the WMS.


On the 18th there is a test maintenance status announced. Matchmaking problems announced by the WMS by the 20th of May: available resources not found.


By the 28th there is a test maintenance announcement



5.   ATLAS VO SAM Tests Review (Slides) – A.Di Girolamo



Below the SAM availability calculated by the ATLAS specific tests.



A.Di Girolamo commented the results for each Site. No changes in the test since the last presentation, in January.


Few issues were encountered:


DE-KIT: SRM Problems causing many time-out and making the ATLAS tests to fail.

NL-T1: The report includes both SARA and NIKHEF. 66% availability and mostly due to SRM timeouts caused by storage overload.

TW-ASGC and US-T1-BNL: The numbers are incorrect.  For BNL their naming schema does not match the one used by WLCG and the results are not available. J.Casey is following the issue. The results from NL are published under a different name and as a Tier-3.


M.Ernst added that BNL is at the last step in order to make the OSG naming consistent with the EGEE conventions. The work should be completed before the ATLAS Cosmic Runs start. It is a critical change to make and they are going to make it before end of the month.


Qin Gang noted that the ASGC problem is now solved since the beginning of May.

A.Di Girolamo agreed and added that in March and April they were publishing data from the Tier-2 and not only for ATLAS.


6.   LHCb VO SAM Tests Review (Comments; Slides) – R.Santinelli


6.1      LHCb SAM Suite

R.Santinelli described the current set of LHCb tests used in the dashboard and as SAM Critical Tests. In the LHCb dashboard below the columns in red are the SAM tests, the columns others are only in the LHCb dashboard.



The tests in the red column are simple commands that, when the fail, are very likely showing an issue of the Site. The other tests are LHCb-specific and can fail because of other causes than a Site’s problem. For instance a bug in an application (Bool, Brunel, Da Vinci, etc) could raise an alarm in the dashboard but is not necessarily due to the Site.


In the LHCb Dashboard above there are the following availability calculations:


1.    WLCG Availability (FCR critical tests/sensors)

  1. LHCb Critical Availability
  2. T2 LHCb Critical Availability (== to point 2, but only CE tests)
  3. LHCb Desired Availability

      Point 2 +

      Condition DB tests (CE)

      Pilot credential Mapping (CE)

      LFC_C  and LFC_L basic ops test + stream replication tests

      FTS tests (basic ops tests)

      SRMv2 Unit Test to all Space tokes defined

6.2      Comments on the LHCb SAM Result in may 2009

Below the SAM availability calculated by the LHCb specific tests.


Except for SARA the 24th and 25th of May (when LHCb had real SAM test failing), the rest of non green bins are due to scheduled or unscheduled downtimes described in the GOC-DB.



Everything is reported in the GOCDB database. R.Santinelli presented the data in the table below.


Note: For CNAF the tests are different than the one presented by H.Renshall, where CNAF was all red while here is all green.




Up/ Down





CASTOR Upgrade (intervention published in GOCDB)




Upgrade of dCache tape system. (Intervention published in GOCDB)








Planned migration of dCache (outage) (published in GOCDB)

Networking (new router) (published in GOCDB)

Cooling failure (published in GOCDB)












Cooling problem affecting both CEs and SEs (published in GOCDB)

Down due to cooling  extended (published in GOCDB)

Down of the cooling extended  (published in GOCDB)

Another cooling  (published in GOCDB)

MSS upgrade




Changing SRM machine power source (published in GOCDB)






24-05- 25-05










Problems with the tape backend again. (published in GOCDB)

Some dCache pool nodes will undergo a firmware upgrade.

SRM down for many reasons (published in GOCDB)

Issues with dCache after upgrade (published in GOCDB)

CE at SARA failing job submission  Got a job held event, reason: Globus error 17: the job failed when the job manager attempted to run it

CE down (published in GOCDB)

SRM down (published in GOCDB)












One CE down (published in GOCDB)

One CE down (published in GOCDB)

CASTOR outage (extension of scheduled down) (published in GOCDB)

Reconfiguration of Router (published in GOCDB)

Restart of the router in T1 network (published in GOCDB)


G.Merino commented that seems that the message is that Sites should only worry about the FCR tests and not the dashboards. But CMS regularly presents “Sites readiness” metrics with the metrics of the dashboards not the FCR’s only. And there are these readiness values assigned to the Sites based on the dashboard values. Sites can only see the FCR tests and there a clearer statement should be clarified.


M.Kasemann noted that CMS distinguishes tests for WLCG and tests for CMS. CMS tests are not assigned to the Sites. The attributes “readiness and commissioning” is not related to the Site availability but whether CMS can run its applications. CMS needs both values.


A.Heiss commented that for the Sites is often also useful to also see the non-SAM tests.


M.Schulz added that some SAM tests are very complex and one test checks many services. When there is a test failure is difficult to find out which services have failed. Tests should be more granular.


7.   User Analysis Working Group (Slides) – M.Schulz


M.Schulz presented the status of the User Analysis WG the as terminated its functions.

7.1      Proposed Mandate

The proposed mandate of the WG at the WLCG Workshop was:

-       Focus on collecting performance relevant information

-       Work on experiment specific benchmarks

-       I/O for analysis

More details are on slide 1.

7.2      Progress after the Workshop

At the WLCG Workshop the proposal above was agreed. To improve the involvement of OSG R.Pordes is now co-chairing the workgroup.


Experiments, sites and SRM providers worked at understanding the problems. The analysis workload was collected at several Sites and the first measurements of SRM command frequencies and timing have been presented at the last GDB and at the DESY workshop. There was also an agreement by all the SRM providers on common metrics. More complete measurements can be expected soon/


Discussions on Xrootd as a common access protocol took place. But no agreement to use as the only protocol was reached.


ATLAS run the HammerCloud tests against all T2s and worked on the tuning of access parameters        

-       Different access strategy for different sites

-       Some of the root-tree caching issues have been fixed. One will need to repeat the detailed analysis Application I/O vs. Network load


CMS working on similar tests.


LHCb carried out a large scale analysis exercise

-       Very high failure rates (up to 30% on the best sites) have been observed. data access based errors

-       Ongoing follow up by data management people


Experiments defined with T2s split of shares between analysis and production use. S.Traylen started working on a template configuration as a reference for Sites.


Experiments documented the I/O requirements for their workflows. However the rates are from the application perspective. As the tests demonstrated the I/O that the fabric sees can be several times larger.


STEP09 is the first large scale test where all activities run in parallel. This will provide necessary data to understand the interference between analysis and other activities and provide data for site wide tuning. WAN access to storage, analysis access, reconstruction access.

7.3      Next Steps

The critical questions that are relevant for the user analysis are followed up by the experiments, sites and storage system developers.


The results are communicated frequently at the relevant WLCG meetings (operations, GDB, etc.)


Solutions for some open issues, such as support for proper ACLs and quotas will not be available before the run start. The focus is on tuning sites and services.


It is not clear what role the Analysis WG should play. Collect information and point out open issues?


I.Bird proposed that all technical issues identified should be followed by the WLCH Technical Group. The User Analysis WG can consider its mandate completed.

Still Sites and Experiments have to nominate some of their members for the Technical Forum.


F.Carminati proposed to thank M.Schulz for chairing the WG. The MB agreed.


8.   Update of the High Level Milestones (WLCG_High_Level_Milestones_20090608.pdf) – A.Aimar



Postponed to next week.


9.   AOB



A.Heiss asked what the deadlines for the resources are in 2010.


I.Bird replied that the LHCC, Experiment and RSG will agree on the information soon. They are aware that it is needed by the Sites.

M.Kasemann added that in the next two weeks ATLAS and CMS are meeting the Scrutiny group to discuss also those requirements.


10.    Summary of New Actions