LCG Management Board
Tuesday 14 October 2008 16:00-17:00 – Phone Meeting
(Version 1 – 20.10.2008)
A.Aimar (notes), I.Bird(chair), D.Barberis, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, I.Fisk, F.Giacomini, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, O.Keeble, M.Lamanna, E.Laure, P.Mato, G.Merino, A.Pace, H.Renshall, R.Tafirout, J.Templon
Mailing List Archive
Wednesday 22 October 2008 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 User Analysis Working Group (Slides) – I.Bird
An updated version of the mandate of the User Analysis WG was circulated. The membership is also being defined.
From Slide 2:
1.3 Client distribution proposal (More Information) – O.Keeble
An updated proposal for the client distribution was distributed.
This proposal was discussed at the GDB and will be again scheduled for next GDB meeting.
J.Gordon will distribute the proposal to the GDB mailing list.
Comments on the proposal for the client distribution should be sent to O.Keeble.
Action List Review (List of actions)
About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.
About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.
DONE. SCAS was distributed, will need to be certified and deployed.
SCAS certified and tested.
- DONE. A document describing the shares wanted by ATLAS
- DONE. Selected sites should deploy it and someone should follow it up.
- ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end
Being discussed in ATLAS. No news.
J.Templon reported that some INFN sites have not correctly deployed and return as ERT (expected response time) to be zero.
L.Dell’Agnello replied that he will follow up the issue.
In today’s agenda.
Updated proposal at section 1.2 of this minutes.
Not done. But is on the agenda for the WLCG Workshop in November.
In today’s agenda. Proposal distributed by O.Keeble.
Proposal distributed will be discussed at the GDB.
On the 22 October there will be a special meeting to prepare the Overview Board.
J.Templon asked whether this issue is also discussed at the Scrutiny group.
I.Bird replied that the Scrutiny group is studying the models and the requirements of the VOs. The preparation for next week is about a statement to the Overview Board and RRB on the changes (or lack of them) caused by the LHC delayed schedule on the 2009 pledges on and the 2009 Experiments’ requirement.
3. LCG Operations Weekly Report (Slides) – H.Renshall
Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
Regular local participation from all IT physics groups and systematic remote participation by BNL, RAL, PIC, NIKHEF and GRIF. Other sites do not participate to all meetings.
3.1 Sites Reports
LHCb’s LFC re-cataloging of srmv1 to srmv2 file names was completed successfully on Tuesday. Srmv1 services were then progressively stopped.
A new CASTOR hot fix release for the 2.1.7 version (-19-1) has been deployed on c2atlas. This release fixed the behavior of pools that have a tape backend, but which refuse new requests when they become full. A bug in this area was found on the pre-production system without having been detected during the certification phase. This is explained by the absence of such a pool in the certification system (this has been fixed). The deployment of 2.1.7-19 on Atlas had to be postponed by 24h due to this incident. Note hotfix -2 affecting updates on replicated disk pools has since been deployed for all experiments.
There was an unannounced upgrade of the AFS UI (job_submit scripts) as part of a voms certificate change at 15.00 on Thursday 9 October but which did not complete properly and stopped the UI from working. Fixed by 17.00.
Cosmics exports to RAL from the weekend were throttled back to 50-60 MB/sec instead of 80-90 MB/sec leading to a 40TB backlog on Monday. Then CASTOR went down with a 60 TB backlog so ATLAS excluded them from cosmics distribution till they recovered.
LHCb’s LFC re-cataloguing of srmv1 to srmv2 entries ran very slowly (5 hours instead of 10 minutes) so was suspended and a complete copy taken from CERN instead.
J.Gordon added that the update was completed but the copy was also taken from CERN. ORACLE was not using all memory available for the caching.
The US PANDA services at BNL (previously the primary instance) is to be stopped by the end of 2008 and CERN will become the primary instance.
Amsterdam area power failure at 16.30 on Saturday 5th October. Srm.grid.sara.nl reported as partially available from 17.00 Sunday 6th but transfers were still failing so ATLAS quarantined the site until late on Monday 7th. Post-Mortem is in preparation.
Bug found in SLC4 FTS: large error messages crashed the server – fixed on Wednesday 8 October. More serious is the bug in the area of no longer returning detailed error messages to the FTS API – apparently this is known since some time. A fix is now in test but we suggest a Post-Mortem on this. Meanwhile ATLAS and CMS transferred about 20TB each last week.
gLite 3.1 Update 33 contained a bdii configuration bug whereby sites not using Yaim had a missing file system root variable causing an incorrect chown. New version being certified to go to PPS this week. This bug took down the GRIF site. Also flickering of resources in the information system seen at several sites – the meta-rpm was hence withdrawn on Friday.
3.2 Experiments Reports
Have been testing SLC4 FTS and CREAM-CE. Prepared a site requirements document for CREAM-CE deployment.
All LFC now converted from srmv1 to srmv2 endpoints allowing sites to shut down the srm1 services.
Over weekend of 11/12 October suffered overload of the CERN Castor server after requesting 180000 pre-stage requests which hit a server that was being drained for hardware replacement. A Post-Mortem is in preparation.
Long running problem of insufficient LSF batch slots being occupied given their shares at CERN finally understood. Another major experiment was requesting unneeded memory swap resources which caused LSF to not schedule on otherwise free job slots.
Magnet ramped up to 3.8 Tesla over weekend of 11/12 October taking cosmics scheduled to run till 27 October (the CRAFT1 run).
Silent file corruption discovered at CERN on one disk server in CASTORCMS/CMSCAF: FIO have started to draft a postmortem for this incident https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081008
- For the affected files the checksum in DBS does not match the files when you retrieve it from castor and ROOT cannot open the file
- Traced to a defective disk on a fileserver that was not clearly reporting errors.
- Using the checksums provided by CMS FIO found that 29 files had a corrupted copy on tape. All 29 files could be repaired because the original diskcopy was still available
- Another 2 files, for which they did not have the CMS checksum, could also be repaired by finding their original copy using the CASTOR log files.
- 4 user files are unrepairable because they were originally created in a Disk0Tape1 service class (DEFAULT) and then recreated in CMSCAF, which is Disk1Tape0. This is a forbidden transition but unfortunately a CASTOR bug allows for it to happen (a fix will be deployed on the 14/10). The original tape copies are still available and users will be informed.
Extensive PreStageTests, see also https://twiki.cern.ch/twiki/bin/view/Atlas/PreStageTests
- Pre-staging is functional on all CASTOR sites CNAF, RAL, CERN, ASGC
- dCache sites are having more problems but BNL, TRIUMF, NDGF should be OK
- Realistic reprocessing rates from tape look hard
- Phase II will start at T1/T2 clouds which can pre-stage
PANDA instance at CERN (the high level workload and pilot jobs manager) planned to become the sole one by the end of 2008.
- Will need existing UI to be expanded from 1 to 3 load balanced servers
- Existing core of 3 mysql servers + 3 hot spares to be migrated to Oracle
- Other servers to be hardened
- Main developer will be seconded to CERN from BNL from January
Too many software upgrades and one does not see this slowing down.
Many miscellaneous failures which will also continue.
To maintain good quality LCG services over the next months is going to require constant vigilance and be labour intensive.
I.Bird noted that a summary of the main achievements and issues of the LCG Services should be presented at the Overview Board.
and Pledges in New CPU Units (Slides)
I.Bird proposed the mandate for the Benchmarking Group
From Slide 1
The transition needs to be prepared carefully and the units need to be clarified.
The costs need to be equivalent to the current ones but in the new units.
The Experiments need to provide their conversion factor.
Ph.Charpentier added that the conversion factor needs to be calculated for the platforms used and on all types of machines provided at the Sites.
M.Kasemann noted that some sites currently take the values from the vendors and others really run the benchmarks. In the future the same approach should be used by all sites so that the Experiments do not have these uncertainties.
J.Gordon replied that this benchmark is not available from the vendors and the Sites ought to run the benchmarks.
D.Barberis also asked that the sites run the benchmarks.
I.Bird noted that if a site runs some benchmarks they could be shared on a web page. For exactly the same types of machine there is no need to benchmark them again.
J.Gordon asked that an executive summary is prepared for Tier-1 and Tier-2 sites. It should explain why the benchmarks are chosen and the Sites needs to use them.
G.Merino stated that the above mandate is clear and the group is just being created. Participants are being contacted. Concluding by November seems unrealistic. At the RRB one can say that the agreement on the group and mandate is reached. The results will be ready for the following RRB meeting.
5. Accounting Reports – I.Bird
Over the summer there was the request to improve the Tier-2 accounting reports. Below is the proposal. (from Slide 2)
J.Gordon noted that adding wall clock will double the lines of the reports because there are too many columns already.
I.Bird replied that the important thing is that one single report includes all values to have an overview also with wall clock time and ratio CPU/wall clock.
D.Barberis asked that an export in Excel would be provided.
I.Bird replied that an export in CSV is already available and one should check whether is sufficient.
J.Gordon concluded that he will ask CESGA to propose some suitable format.
6. Middleware Planning (Slides) – O.Keeble
After the pre-GDB meeting the plan was modified and presented to the GDB in October.
- Status - A problem has been found in the recent SL4 release (undeployed) so a fixing iteration will be necessary.
- Integration will set up an SL5 build to get an idea of its potential but its deployment is not currently the plan.
- Target - full deployment of 2.1 on SL4
- Here we should be more aggressive:
- LCG-CE inherently problematic for analysis
- If the use case is direct submission with no proxy renewal, CREAM is basically ready
- Proxy renewal should be fixed in the simplest possible way (reproduce the lcg-CE solution, suitable for different users)
- WMS submission will come with ICE, timescale months
- Target – maximum availability in parallel with lcg-CE
F.Hernandez asked by when CREAM could/will be deployed.
O.Keeble replied that CREAM is already deployed at some sites and is suitable when there is no need of proxy renewal and where the submission is directly to the CE (e.g. for ALICE). For the solution with proxy renewal there are changes needed to the WMS.
Ph.Charpentier added that LHCb is also interested in direct submission and will be added to DIRAC.
M.Lamanna asked the status of submission to CREAM via CondorG.
F.Giacomini replied they are in contact with the Condor team and they agreed to implement it. There is no time scale known. Currently there is the CREAM client only or one has to write its own client.
- Status - FIO now has a first installation at CERN in 32 and 64 bits, which will be tested by the experiments.
- Target – have SL5 available on the infrastructure in parallel to SL4
- We should also continue to pursue the python2.5 and alternative compiler stuff, but this can be added subsequently.
Multiple parallel versions of middleware available on the WN
- Status - at the moment it is not easy to install or use multiple parallel versions of the middleware at a site. While the multi middleware versions and multi compiler support are not disruptive, they require some changes on the packaging side and a small adaptation on the user side.
- Target - it seems advisable to introduce this relatively shortly after the bare bone WN on SL5.
- Target - enabling of multi-user pilot jobs via glexec. This could conceivably be via another means than SCAS, but this would have to be decided asap. Patch #2511 has arrived
- Status - Glue2 is awaiting final validation at OGF, expected in November.
- Target - we should try to get the new schema deployed to the BDIIs so we can iron out initial deployment glitches, leaving us with a working but unpopulated Glue2 info system in parallel to 1.3. Info providers could subsequently be upgraded gradually, as could clients.
- Status - A set of changes to rationalise publishing of heterogeneous computing resources is envisaged. A full roadmap will be published by Steve Traylen this week. The first phase will be the deployment of the new tools, enabling simply the current situation. Subsequent phases then take advantage of the new tools.
- Target - the first phase as described above.
- Status: Patched WMS ( fixing bug #39641 & bug #32345) within 1 week
- Target: This patch should be deployed
- ICE to submit to CREAM: Not required for certification. ICE will be added in a subsequent update (but better before Feb. 2009)
- These are being back ported to VDT1.6. Important for dCache and FTS
Publishing of detailed service versions
- Several small improved information providers are in certification
- More could be added. Not very invasive, but potentially useful
The MB agreed on the proposed plan.
7. ATLAS QR Report (Slides) – D.Barberis
7.1 Organization News
The ATLAS Collaboration Board met last Friday and took the following decisions (among others):
- Dario Barberis was re-appointed as Computing Coordinator from March 2009 until February 2010
- David Quarrie was re-appointed as Software Project Leader from March 2009 until February 2010
- Kors Bos was elected Deputy Computing Coordinator from March 2009 until February 2010. And will become Computing Coordinator from March 2010 until February 2011
Hans von der Schmitt’s term in office as Database Coordinator ended on
30 Sept after 2.5 years.
- Giovanna Lehmann Miotto (CERN) for online databases
- Elizabeth Gallas (Oxford) for offline databases
7.2 Tier-0 and Data-taking Activities
ATLAS are taking continuously cosmic ray data since several months and until 3rd November, With only short breaks for detector work (and LHC data!). The Tier-0 is coping well with nominal data rates and processing tasks.
A few Castor glitches are usually sorted out with the Castor team within a very reasonable time.
In November hardware detector commissioning work will restart but cosmic data-taking will carry on with partial read-out.
Below the data rate from the online to the offline of ATLAS.
The tape queue can handle the amount of data and copy it on tape.
7.1 Data Export
ATLAS export all raw and processed data from Tier-0 to Tier-1s and Tier-2s according to the computing model. The system can sustain the peak rate of 1.2 GB/s for an indefinite time.
Data distribution patterns are periodically revised as data types (triggers) and processing needs change.
Data export summary from CERN 7-13 October (MB/s)
For instance on the 9-10 October the 1.2GB rate is consistently maintained.
7.2 Pre-Staging Tests
ATLAS started during the summer pre-staging tests at all Tier-1s. Recalling whole datasets at a time (up to 10 TB).
Performance varies a lot as tape back-ends are different at each site. After a few tries, most sites are mostly working even if there are outstanding (different) problems at PIC, FZK and SARA.
This exercise also showed that the number of available tape drives varies a lot from site to site. There is no point in having 1000s of processing
cores if they cannot be fed at the correct rate with data.
For instance CERN and IN2P3 could store all the dataset required, although with different patterns.
At other sites not all datasets were correctly distributed and the issue is being studied.
7.3 Database Access Issues
Early tests of database scalability did not indicate there would be any problem with reprocessing at Tier-1s.
More recent tests instead showed a serious limitation when more than a few 10s (up to 100) jobs start simultaneously, as they all access conditions data from Oracle databases. Two factors differed between these tests: (1) Oracle Streams are now used to move data from CERN to Tier-1s and (2) DCS (Detector Control System) data are now accessed by reconstruction tasks.
ATLAS started a task force to analyse data access patterns from the Oracle server side with ATLAS and CERN DBAs. It also includes the work to instrument Athena to log database access and data volumes and on detector code developers to revise and optimise their database access
ATLAS started to explore SQLite technology for reprocessing tasks: Dump all data for a given run to a SQLite file and use it locally for all jobs. This reduces the database access by a factor of several 100s (the number of files in a run).
Oracle will still be needed at the sites for other activities.
7.4 ATLAS Disk Usage
Below is the disk pledges (dotted line), the installed capacity (green line) and the usage of the VOs at the Sites (ATLAS in light blue).
ATLAS has several upcoming software releases:
- 14.X.Y.- will include bug fixes only for HLT/Tier-0 and Grid operations
- 15.0.0 - February 2009. Include feedback from 2008 cosmic running and will be the base release for 2009 operations.
The cosmic runs with the complete detector will continue till early November 2008 and restarting late March 2009. While with partial read-out will continue all the time at reduced rates.
For collision data: ATLAS will be ready to go from April 2009 for what concerns the ATLAS Software & Computing.
T.Cass asked whether the ATLAS software will be certified on SL5.
D.Barberis confirmed that, because the sites will be migrating, the ATLAS software must run on SL4 and SL5. Both on 32 bits versions, therefore the WN must have the 32-bit compatible libraries installed.
I.Bird asked whether the pre-staging issues are due to the lack of tape drives or the throughput is really not sufficient.
D.Barberis replied that is probably a combination of both problems and both disk-tape and tape-disk rates should be studied.
Ph.Charpentier asked how the staging will be scheduled (progressively or asking as much as possible).
D.Barberis replied that CASTOR sites want to do receive as many request as possible to optimize the pre-staging, while dCache has a different approach ad requests must be scheduled over time.
G.Merino asked how ATLAS will solve the issue of the pre-stage low rate. Will ATLAS repeat these tests?
D.Barberis replied that tests are ongoing and they will try with more coordination among activities (data taking, analysis, etc). They will define the achievable rate and discuss with each of the Sites in order to be ready in 2009.
J.Templon added that a dataflow diagram like the one of LHCb provided is the best description for the Sites.
J.Gordon agreed that there are many activities of RAW data storage and pre-staging and reading, the Sites need the overall rates needed by the Experiments.
ATLAS should provide the overall rates expected at each Site. (Same for the other VO?)
9. Summary of New Actions
ATLAS should provide the overall rates expected at each Site. (Same for the other VO?)