WLCG Management Board
Tuesday 21 April 2009 –MB Meeting - 16:00-17:00
(Version 1 – 26.4.2009)
A.Aimar (notes), O.Barring, I.Bird (chair), K.Bos, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, G.Merino, A.Pace, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 28 April 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters Arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
2. Action List Review (List of actions)
The Experiments agreed to present their dataflow and rates at May’s GDB.
Will be for next week.
L.Dell’Agnello summarized the phone call with R.Wartel discussing CNAF failing the recent test alarms. The decision is that CNAF’s local operators and grid managers are going to be trained in order to cover future grid incidents. A written procedure is going to be made available to the people involved and CNAF will internally simulate the test alarms.
I.Bird asked whether the Security
team intends to re-run the challenges.
L.Dell’Agnello stated that in 2
weeks a document will explain the changes that CNAF is going to implement. In
addition one new security expert will also be involved in grid security
On User Accounting:
I.Bird asked that the APEL portal should have this
Distributed to the MB list.
Done by S.Foffano
To be done.
1. LCG Operations Weekly Report (Slides) – O.Barring
Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
1.1 GGUS Tickets
The GGUS ticket rate is back to normal. A few unscheduled interventions occurred during this period, but no serious events that triggered Service Incident Reports.
Once again sites (e.g. IN2P3 did) stress that experiments should submit their requests only via GGUS tickets.
Two alarms tickets open this week:
ATLAS alarm to FZK: (2009-04-16 18:07)
Disk buffer in front of ATLASMCTAPE in FZK is full. Could not catch up with
the incoming rate. Details in slides 4 and 5.
A.Heiss asked whether that ATLAS problem should have been an alarm or not. Maybe the WLCG could discuss it at the GDB. A T2-T1 transfer, as initially thought, should not have raised an alarm.
K.Bos noted that is the Experiment that knows whether an issue is worth an alarm or not. Even a data transfer or MC production can be crucial and the Experiment should judge the merit of the incident. A general clarification at the GDB would be useful.
LHCb alarm to FZK: (2009-04-11 18:36) All LHCb
jobs to ce-2-fzk.gridka.de failing. Details in slide 6.
1.2 Sites Availability
Slide 7 shows the Tier-1 SAM availability from the Experiments’ perspective.
One can see that LHCb (bottom right matrix in the slides, and here below) had problems with a few Sites.
LHCb reported the problems during the Operations meetings:
- WMS submission is failures were traced to problems with short CRL for certificates created by the CERN CA (thanks to M.Jouvin, GRIF). Fixed
- CNAF BDII publishing wrong information making match-making impossible. Fixed
- CERN CVS system failing: Reason unclear. Fixed?
- The job failures yesterday at NIKHEF and IN2P3 are now explained by the pre-pended "root:" string to the returned tURL.
Ph.Charpentier noted that this issue is unclear and it not due to this reason.
- The problem of jobs crashing accessing the LFC@CERN is still under investigation but seems to be that the thread pool in LFC becomes exhausted due the way CORAL is accessing it. Understood?
Ph.Charpentier noted that it seems to be due to the “suboptimal” access to LFC by CORAL.
1.3 Sites Issues and News
Some confusion about the effect of using ‘At Risk’ for transparent interventions at the Sites. It is NOT counted as site downtime, Sites should be aware of this.
BNL: degraded efficiency due to large number tape staging requests from the ATLAS production tasks (pile up and HITS merging) and this caused a high load on the dCache/pnfs server resulting in an unacceptably high failure rate for DDM transfers.
CERN: Good news on Castor Oracle BIGID problem. From https://savannah.cern.ch/support/?106879
After joint work with Sebastien and excellent feedback from some people from Oracle including Oracle development, it looks now clear that the problem is linked with the usage of "DML returning" statements accessed from OCCI. Basically it works for single row but with different types and combination of single row / multiple rows, it can “not work” and lead to issues like the Big Id issue.
Oracle has opened a documentation bug (public and accessible with Metalink account) about the issue: “OCCI does not support 'retuning into'…”
But for the time being a workaround must be found by the CASTOR team.
J.Gordon asked that in SAM matrices in Slide 7 GridView should specify which is the VO represented.
Upon request from K.Bos, Qin Gang reported some news about ASGC:
- There are now 2 tape drivers LT03 online.
- The schedule of the installation of the other six LT04 tape drivers will be late of two weeks.
2. ALICE 2009Q1 QR Report (Slides) – Y.Schutz
Y.Schutz presented the 2009Q1 quarterly report for ALICE.
2.1 Data Taking and Processing
Because of planned interventions on the experiment, such as cabling modification and installation of additional detectors, ALICE data taking stopped in October 2008.
There will be short periods of data taking:
- cosmic rays only with a subset of detectors is starting in April, not generating a large amount of data;
- cosmics with the complete detector will resume in June 2009
Full data processing (cosmic) chain will be reactivated in June 2009 and will include:
- On line calibration and QA
- Collection of calibration data in OCDB
- On demand, not automatic, replication to Tier-1 Sites
- Data reconstruction at Tier-0, on demand re-reconstruction at Tier-1 Sites
- Data analysis on Grid and on the existing analysis facilities at CAF and GSIAF. And possibly also LAF, in Lyon.
ALICE also continues to test data transfers with periodic tests of FTS/FTD from Tier-0 to Tier-1 Sites.
2.2 Monte Carlo Data
Continuous production is in progress and is prioritized according to the LHC schedule:
- Small first physics productions
- pp minimum bias (large 100 Mio events)
- pp various physics signals (20 different cycles)
- AA heavy ion data with lowest priority
End user analysis is ongoing in ALICE and it is increasing the number of users. Presently there are some 120 regular ALICE users doing analysis.
2.3 Software: AliRoot
The software for all detectors is ready and includes geometry, particle transport, raw data format, calibration, QA, reconstruction, etc. The new improved TPC tracking (HLT) and TRD reconstruction has been added.
Pileup simulation and vertex reconstruction is implemented and being tested. Improvements and new implementation of trigger simulation is always on the way.
There now are regular weekly reviews on new MC productions in order to review the usage of resources.
2.4 Analysis Facilities
In the ALICE Analysis software the framework has been finalized, with only minor fixes. The Physics Working group continuously develops and consolidates analysis algorithms. And the ALICE Analysis Train is now part of the Grid processing.
There are now several facilities being used by ALICE:
CAF and GSIAF
- Regular ROOT/AliRoot and core PROOF updates
- In production, high availability and demand service, all new MC productions validated on CAF
- Lyon Analysis Facility under development
- A PROOF-enabled farm is in preparation
2.5 ALICE Services
The stable version of AliEn is in operation.
Job submission is done through WMS. Submission code updates and WMS patches are increasing the service stability. The new WMS gLite 3.2 instances added (4 at CERN, LAL, IPN0)
The CREAM CE deployment is ongoing; the most recent sites are CERN, Legnaro, IHEP and Nantes. It will likely meet the deadline (end June) for parallel deployment on all ALICE sites.
SLC5 is not in the plan, but highly desirable by ALICE. All ALICE software is ported to SL5.
Participation in STEP09 is being planned among the activities shown in the time line below.
Deployment of storage continues; individual site validation model is working very well.
ALICE would like to have a large CASTOR2 disk pool for RAW data registration. The work has started between IT/DM, IT/FIO and ALICE on setting up and testing of a CASTOR2 instance with xrootd access for RAW data- the size of the pool will be O (1.5PB).
It would use CASTOR v.2.1.8 with latest xrootd development. The current progress: a test pool to tune:
- writing from DAQ P2 buffer (via xrootd),
- Grid jobs reconstruction and
- copy from disk to tape
- and, later, to add T0 to T1 transfers via FTD/FTS
The present and upcoming milestones for ALICE are:
- MS-129 Mar 09: Analysis train operational. Done, adding and testing new wagons (algorithm by the Physicists) as they become available.
- MS-130 Jun 09: CREAM CE deployed at all ALICE sites
- MS-131 26 Jun 09: AliRoot release ready for data taking
I.Bird asked whether ALICE received any feedback, regarding their new requirements, from the Resources Scrutiny Group.
Y.Schutz replied that he had just replied to the questions received from the RSG group.
H.Renshall added that the RSG will meet on the following day.
3. STEP09 Metrics – A.Aimar
At the WLCG Workshop was said that there should be clear metrics to measure the STEP09’s achievements (tape rates, etc).
Collecting common metrics, attempted in the past, has failed because Sites have different configurations and can collect only different measurements.
Should we just have a list of URLs where Sites publish the metrics they can collect?
H.Renshall added that ATLAS is defining the rates they expect from the different Sites.
I.Bird stated that there should be a way to see what the rates reached by the Sites are.
F.Hernandez noted that at IN2P3 these MSS metrics are only visible in the intranet, not publicly.
Next week there will be a round table and Sites should tell how they are going to show which rates they reach in order to prove that they are fulfilling the required tape rates in STEP09.
F.Hernandez added that the Sites have problems to know what kind data is written by the Experiments. The Sites can only have aggregated data, maybe by Experiment. But not distinguishing the kind of data.
K.Bos added that during the ATLAS challenge there were other ATLAS tape activities by another community and ATLAS could not distinguish the tape activity within the Experiment itself.
It is a common problem that Experiments have to solve.
4. Update High Level Milestones (HLM_20090406.pdf) – A.Aimar
The MB reviewed and commented the future milestones in the HLM table attached.
4.1 Sites SLAs
ALICE and CMS still have to approve the SLAs at some Tier-1 Sites.
4.2 Pilot Jobs Frameworks
The Experiments’ frameworks are followed at by M.Litmaath and reports monthly at the GDB.
A.Aimar will update the status below when there are news.
4.3 Tier-2 and VO Sites Reliability Reports
A.Aimar will check the percentage of Tier-2 Sites above 95%.
From April 2009 we will start reporting on the VO SAM Sites Reliability.
4.4 SL5 Deployment
4.5 Tier-1 Sites Procurement
Will be updated after the RRB discussions.
4.6 SCAS/glExec Deployment
The SCAS/glExec was certified weeks ago and is available for deployment.
M.Schulz noted that Sites should start installing it. Just two or three Sites are not sufficient. Volunteer Sites for stress testing by the Experiments are needed. At Tier-1 or large Tier-2.
Ph.Charpentier added that LHCb had started testing at FZK and NL-T1.
F.Hernandez reminded that, as reported at the GDB by A.Retico, some Sites (e.g. IN2P3) need to adapt the current solution to their setup. GlExec cannot be installed on a shared installation of the WNs.
I.Bird noted that the deployment at all Sites (WLCG 09-19) should maybe be moved forward. After the next GDB the issue will be clearer. And then the MB agrees on the milestone date.
J.Gordon concluded that he will ask A.Retico to proceed with more installations so that the issues at each Tier-1 Sites are collected.
4.7 Accounting Milestones
J.Gordon reported that WLCG-09-02 is done.
J.Gordon reported that Sites are publishing the information now (WLCG-09-03). Some still have “0” physical CPU we are in a transition period.
4.8 STEP09 Tier-1 Validation
Each Experiment should define its Sites’ validation criteria and the status should be reported and tracked in the HLM.
4.9 CREAM CE Rollout
Milestones proposed by M.Schulz. WLCG-09-25 is done.
Some Sites have already installed the CREAM CE (WLCG-09-26)
4.10 SRM Milestones
I.Bird noted that these milestones should be on hold until the SRM developers meet. At last GDB was presented what is missing for each of the implementations.
J.Gordon added that the Experiment’s feedback is needed and was asked at the GDB.
4.11 FTS Milestone
All Sites are running FTS on SL4 by now.
4.12 Metrics and Monitoring Milestones
As discussed earlier, the Sites should send to A.Aimar how they will make MSS available the Sites metrics.
Here is the new table, updated after the discussion. HLM_20090427.pdf
6. Summary of New Actions