LCG Management Board

Date/Time

Tuesday 14 October 2008 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39175

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 20.10.2008)

Participants

A.Aimar (notes), I.Bird(chair), D.Barberis, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, I.Fisk, F.Giacomini, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, O.Keeble, M.Lamanna, E.Laure, P.Mato, G.Merino, A.Pace, H.Renshall, R.Tafirout, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Wednesday 22 October 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting 

The minutes of the previous MB meeting were approved.

1.2      User Analysis Working Group (Slides) – I.Bird

An updated version of the mandate of the User Analysis WG was circulated. The membership is also being defined.

 

From Slide 2:

 

Mandate:

-       Clarify and confirm the roles of the Tier 1, 2, 3 sites for analysis activities for each experiment; categorising the various types of analysis: batch or interactive, centrally organised or individual users.

 

-       List the minimal set of services, their characteristics (availability, scale, concurrent usage, users, etc.), or changes to existing services, essential to implement those models.  These must be prioritised clearly between “essential” and “desirable”.  The expectation is that this is addressing configuration of services, recommendations for service implementation, or new tools to assist managing or using the services rather than requests for new services.  Such developments must be clearly justified.

 

Deliverables:

-       Documented analysis models

-       Report on requirements for services and developments with priorities and timescales.

 

 

1.3      Client distribution proposal (More Information) – O.Keeble

An updated proposal for the client distribution was distributed.

 

https://twiki.cern.ch/twiki/bin/view/EGEE/ParallelMWClients

 

This proposal was discussed at the GDB and will be again scheduled for next GDB meeting.

 

J.Gordon will distribute the proposal to the GDB mailing list.

Comments on the proposal for the client distribution should be sent to O.Keeble.

 

2.   Action List Review (List of actions) 
 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.

DONE. SCAS was distributed, will need to be certified and deployed.

 

New Action:

SCAS certified and tested.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

Being discussed in ATLAS. No news.

 

J.Templon reported that some INFN sites have not correctly deployed and return as ERT (expected response time) to be zero.

L.Dell’Agnello replied that he will follow up the issue.

  • Converting Experiments requirements and Sites pledges to new CPU units.

 

In today’s agenda.

  • Form a working group for User Analysis for a strategy including T1 and T2 Sites.

Updated proposal at section 1.2 of this minutes. 

  • 23 Sept 2008 - Prepare a working group to evaluate possible SAM tests verifying the MoU metrics and requirements.

Not done. But is on the agenda for the WLCG Workshop in November.

  • 8 Oct 2008 – O.Keeble sends to the MB a proposal on possible upgrades to middleware because of the LHC delays.

 

In today’s agenda. Proposal distributed by O.Keeble.

  • 13 Oct 2008 - O.Keeble will send an updated proposal for software distribution and update at the Tier-1 Sites.

 

Proposal distributed will be discussed at the GDB.

  • 22 Oct 2008 - By the last week of October the Experiments should provide their new estimated due to the delay in the LHC operations. The assumption is that LHC data taking starts on the April-May 2009 as announced by the DG.

 

On the 22 October there will be a special meeting to prepare the Overview Board.

 

J.Templon asked whether this issue is also discussed at the Scrutiny group.

I.Bird replied that the Scrutiny group is studying the models and the requirements of the VOs. The preparation for next week is about a statement to the Overview Board and RRB on the changes (or lack of them) caused by the LHC delayed schedule on the 2009 pledges on and the 2009 Experiments’ requirement.

 

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall

 

Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

 

Regular local participation from all IT physics groups and systematic remote participation by BNL, RAL, PIC, NIKHEF and GRIF. Other sites do not participate to all meetings.

3.1      Sites Reports

 

CERN

LHCb’s LFC re-cataloging of srmv1 to srmv2 file names was completed successfully on Tuesday. Srmv1 services were then progressively stopped.

 

A new CASTOR hot fix release for the 2.1.7 version (-19-1) has been deployed on c2atlas. This release fixed the behavior of pools that have a tape backend, but which refuse new requests when they become full. A bug in this area was found on the pre-production system without having been detected during the certification phase. This is explained by the absence of such a pool in the certification system (this has been fixed). The deployment of 2.1.7-19 on Atlas had to be postponed by 24h due to this incident. Note hotfix -2 affecting updates on replicated disk pools has since been deployed for all experiments.

 

There was an unannounced upgrade of the AFS UI (job_submit scripts) as part of a voms certificate change at 15.00 on Thursday 9 October but which did not complete properly and stopped the UI from working. Fixed by 17.00. 

 

RAL

Cosmics exports to RAL from the weekend were throttled back to 50-60 MB/sec instead of 80-90 MB/sec leading to a 40TB backlog on Monday. Then CASTOR went down with a 60 TB backlog so ATLAS excluded them from cosmics distribution till they recovered.

 

LHCb’s LFC re-cataloguing of srmv1 to srmv2 entries ran very slowly (5 hours instead of 10 minutes) so was suspended and a complete copy taken from CERN instead.

 

J.Gordon added that the update was completed but the copy was also taken from CERN. ORACLE was not using all memory available for the caching.

 

BNL:

The US PANDA services at BNL (previously the primary instance) is to be stopped by the end of 2008 and CERN will become the primary instance.

 

NL-T1

Amsterdam area power failure at 16.30 on Saturday 5th October. Srm.grid.sara.nl reported as partially available from 17.00 Sunday 6th but transfers were still failing so ATLAS quarantined the site until late on Monday 7th. Post-Mortem is in preparation.

 

General:

Bug found in SLC4 FTS: large error messages crashed the server – fixed on Wednesday 8 October. More serious is the bug in the area of no longer returning detailed error messages to the FTS API – apparently this is known since some time. A fix is now in test but we suggest a Post-Mortem on this. Meanwhile ATLAS and CMS transferred about 20TB each last week.

 

gLite 3.1 Update 33 contained a bdii configuration bug whereby sites not using Yaim had a missing file system root variable causing an incorrect chown. New version being certified to go to PPS this week. This bug took down the GRIF site. Also flickering of resources in the information system seen at several sites – the meta-rpm was hence withdrawn on Friday.

3.2      Experiments Reports

 

ALICE

Have been testing SLC4 FTS and CREAM-CE. Prepared a site requirements document for CREAM-CE deployment.

 

LHCb

All LFC now converted from srmv1 to srmv2 endpoints allowing sites to shut down the srm1 services.

 

Over weekend of 11/12 October suffered overload of the CERN Castor server after requesting 180000 pre-stage requests which hit a server that was being drained for hardware replacement. A Post-Mortem is in preparation.

 

Long running problem of insufficient LSF batch slots being occupied given their shares at CERN finally understood. Another major experiment was requesting unneeded memory swap resources which caused LSF to not schedule on otherwise free job slots.

 

CMS

Magnet ramped up to 3.8 Tesla over weekend of 11/12 October taking cosmics scheduled to run till 27 October (the CRAFT1 run).

 

Silent file corruption discovered at CERN on one disk server in CASTORCMS/CMSCAF: FIO have started to draft a postmortem for this incident https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081008

-       For the affected files the checksum in DBS does not match the files when you retrieve it from castor and ROOT cannot open the file

-       Traced to a defective disk on a fileserver that was not clearly reporting errors.

-       Using the checksums provided by CMS FIO found that 29 files had a corrupted copy on tape. All 29 files could be repaired because the original diskcopy was still available

-       Another 2 files, for which they did not have the CMS checksum, could also be repaired by finding their original copy using the CASTOR log files.

-       4 user files are unrepairable because they were originally created in a Disk0Tape1 service class (DEFAULT) and then recreated in CMSCAF, which is Disk1Tape0. This is a forbidden transition but unfortunately a CASTOR bug allows for it to happen (a fix will be deployed on the 14/10). The original tape copies are still available and users will be informed.

 

ATLAS

Extensive PreStageTests, see also  https://twiki.cern.ch/twiki/bin/view/Atlas/PreStageTests

-       Pre-staging is functional on all CASTOR sites  CNAF, RAL, CERN, ASGC

-        dCache sites are having more problems but BNL, TRIUMF, NDGF should be OK

-        Realistic reprocessing rates from tape look hard

-        Phase II will start at T1/T2 clouds which can pre-stage

 

PANDA instance at CERN (the high level workload and pilot jobs manager) planned to become the sole one by the end of 2008.

-       Will need existing UI to be expanded from 1 to 3 load balanced servers

-       Existing core of 3 mysql servers + 3 hot spares to be migrated to Oracle

-       Other servers to be hardened

-       Main developer will be seconded to CERN from BNL from January

 

3.3      Conclusions

Too many software upgrades and one does not see this slowing down.

Many miscellaneous failures which will also continue.

 

To maintain good quality LCG services over the next months is going to require constant vigilance and be labour intensive.

 

I.Bird noted that a summary of the main achievements and issues of the LCG Services should be presented at the Overview Board.

 

 

4.   Requirements and Pledges in New CPU Units (Slides) - I.Bird

I.Bird proposed the mandate for the Benchmarking Group

 

From Slide 1

We have agreed that we will change the units we use to express requirements, pledges, installed and accounted CPU capacity from SI2K to SPEC2006-C++

 

The group needs to:

-       Agree and publish the recipe and the conditions under which the benchmark is run (should come from the HEPiX group/H.Meinhard)

-       Agree on conversion factors to convert experiment requirement and site pledges tables from si2k units to the new units.

-       Requirements and pledges (so that we do not introduce a change in requirement or cost) – this is really a change in baseline of existing requirements and pledges.

-       Experiment codes as they are used to define the requirements

 

 

The transition needs to be prepared carefully and the units need to be clarified.

The costs need to be equivalent to the current ones but in the new units.

 

The Experiments need to provide their conversion factor.

 

Ph.Charpentier added that the conversion factor needs to be calculated for the platforms used and on all types of machines provided at the Sites.

 

M.Kasemann noted that some sites currently take the values from the vendors and others really run the benchmarks. In the future the same approach should be used by all sites so that the Experiments do not have these uncertainties.

J.Gordon replied that this benchmark is not available from the vendors and the Sites ought to run the benchmarks.

 

D.Barberis also asked that the sites run the benchmarks.

I.Bird noted that if a site runs some benchmarks they could be shared on a web page. For exactly the same types of machine there is no need to benchmark them again.

 

J.Gordon asked that an executive summary is prepared for Tier-1 and Tier-2 sites. It should explain why the benchmarks are chosen and the Sites needs to use them.

 

G.Merino stated that the above mandate is clear and the group is just being created. Participants are being contacted. Concluding by November seems unrealistic. At the RRB one can say that the agreement on the group and mandate is reached. The results will be ready for the following RRB meeting.

 

5.   Accounting Reports – I.Bird

 

Over the summer there was the request to improve the Tier-2 accounting reports. Below is the proposal. (from Slide 2)

 

Requests to update the Tier 2 reports to be more in line with the Tier 1 reports:

-       To include the wall clock time and cpu/wall ratio

-       To add the installed and pledged capacities

-       To include storage data (pledged, installed, used?)

-       To include incremental use during the year for Tier 2s

 

Note: there are 2 parts to the reporting

-       That generated from APEL through which is the data on accounted cpu

-       The formal report which takes the data from APEL and applies the knowledge of pledges and installed capacity

So:

-        Ask the APEL report for Tier 2 to include the wall time

-       New info providers that automatically collect installed capacities for Tier 2s and collect storage data (installed and used) for all (J.Gordon + F.Donno + work from WN working group)

-       Update the final reports to include all requested changes

 

 

J.Gordon noted that adding wall clock will double the lines of the reports because there are too many columns already.

I.Bird replied that the important thing is that one single report includes all values to have an overview also with wall clock time and ratio CPU/wall clock.

 

D.Barberis asked that an export in Excel would be provided.

I.Bird replied that an export in CSV is already available and one should check whether is sufficient.

 

J.Gordon concluded that he will ask CESGA to propose some suitable format.

 

6.   Middleware Planning (Slides) – O.Keeble

 

After the pre-GDB meeting the plan was modified and presented to the GDB in October.

 

FTS/SL4

-       Status - A problem has been found in the recent SL4 release (undeployed) so a fixing iteration will be necessary.

-       Integration will set up an SL5 build to get an idea of its potential but its deployment is not currently the plan.

-       Target - full deployment of 2.1 on SL4

 

CREAM

-       Here we should be more aggressive:

-       LCG-CE inherently problematic for analysis

-       If the use case is direct submission with no proxy renewal, CREAM is basically ready

-       Proxy renewal should be fixed in the simplest possible way (reproduce the lcg-CE solution, suitable for different users)

-       WMS submission will come with ICE, timescale months

-       Target – maximum availability in parallel with lcg-CE

 

F.Hernandez asked by when CREAM could/will be deployed.

O.Keeble replied that CREAM is already deployed at some sites and is suitable when there is no need of proxy renewal and where the submission is directly to the CE (e.g. for ALICE). For the solution with proxy renewal there are changes needed to the WMS.

Ph.Charpentier added that LHCb is also interested in direct submission and will be added to DIRAC.

 

M.Lamanna asked the status of submission to CREAM via CondorG.

F.Giacomini replied they are in contact with the Condor team and they agreed to implement it. There is no time scale known. Currently there is the CREAM client only or one has to write its own client.

 

WN/SL5

-       Status - FIO now has a first installation at CERN in 32 and 64 bits, which will be tested by the experiments.

-       Target – have SL5 available on the infrastructure in parallel to SL4

-       We should also continue to pursue the python2.5 and alternative compiler stuff, but this can be added subsequently.

 

 Multiple parallel versions of middleware available on the WN

-       Status - at the moment it is not easy to install or use multiple parallel versions of the middleware at a site. While the multi middleware versions and multi compiler support are not disruptive, they require some changes on the packaging side and a small adaptation on the user side.

-       Target - it seems advisable to introduce this relatively shortly after the bare bone WN on SL5.

 

glexec/SCAS

-       Target - enabling of multi-user pilot jobs via glexec. This could conceivably be via another means than SCAS, but this would have to be decided asap. Patch #2511 has arrived

Glue2

-       Status - Glue2 is awaiting final validation at OGF, expected in November.

-       Target - we should try to get the new schema deployed to the BDIIs so we can iron out initial deployment glitches, leaving us with a working but unpopulated Glue2 info system in parallel to 1.3. Info providers could subsequently be upgraded gradually, as could clients.

 

CE publishing

-       Status - A set of changes to rationalise publishing of heterogeneous computing resources is envisaged. A full roadmap will be published by Steve Traylen this week. The first phase will be the deployment of the new tools, enabling simply the current situation. Subsequent phases then take advantage of the new tools.

-       Target - the first phase as described above.

 

WMS

-       Status: Patched WMS ( fixing bug #39641 &  bug #32345) within 1 week

-       Target: This patch should be deployed

-       ICE to submit to CREAM: Not required for certification. ICE will be added in a subsequent update (but better before Feb. 2009)

 

gridftp2 patches

-       These are being back ported to VDT1.6. Important for dCache and FTS

 

Publishing of detailed service versions

-       Several small improved information providers are in certification

-       More could be added. Not very invasive, but potentially useful

 

Decision:

The MB agreed on the proposed plan.

 

7.   ATLAS QR Report (Slides) – D.Barberis

 

7.1      Organization News

 

The ATLAS Collaboration Board met last Friday and took the following decisions (among others):

-       Dario Barberis was re-appointed as Computing Coordinator from March 2009 until February 2010

-       David Quarrie was re-appointed as Software Project Leader from March 2009 until February 2010

-       Kors Bos was elected Deputy Computing Coordinator from March 2009 until February 2010. And will become Computing Coordinator from March 2010 until February 2011

-       Hans von der Schmitt’s term in office as Database Coordinator ended on 30 Sept after 2.5 years.
He is replaced by:

-       Giovanna Lehmann Miotto (CERN) for online databases

-       Elizabeth Gallas (Oxford) for offline databases

7.2      Tier-0 and Data-taking Activities

 

ATLAS are taking continuously cosmic ray data since several months and until 3rd November, With only short breaks for detector work (and LHC data!). The Tier-0 is coping well with nominal data rates and processing tasks.

 

A few Castor glitches are usually sorted out with the Castor team within a very reasonable time.

 

In November hardware detector commissioning work will restart but cosmic data-taking will carry on with partial read-out.

 

Below the data rate from the online to the offline of ATLAS.

 

 

The tape queue can handle the amount of data and copy it on tape.

7.1      Data Export

ATLAS export all raw and processed data from Tier-0 to Tier-1s and Tier-2s according to the computing model. The system can sustain the peak rate of 1.2 GB/s for an indefinite time.

Data distribution patterns are periodically revised as data types (triggers) and processing needs change.

 

Data export summary from CERN 7-13 October (MB/s)

 

For instance on the 9-10 October the 1.2GB rate is consistently maintained.

 

7.2      Pre-Staging Tests

ATLAS started during the summer pre-staging tests at all Tier-1s. Recalling whole datasets at a time (up to 10 TB).

Performance varies a lot as tape back-ends are different at each site. After a few tries, most sites are mostly working even if there are outstanding (different) problems at PIC, FZK and SARA.

 

This exercise also showed that the number of available tape drives varies a lot from site to site. There is no point in having 1000s of processing

cores if they cannot be fed at the correct rate with data.

 

For instance CERN and IN2P3 could store all the dataset required, although with different patterns.

 

 

 

At other sites not all datasets were correctly distributed and the issue is being studied.

7.3      Database Access Issues

Early tests of database scalability did not indicate there would be any problem with reprocessing at Tier-1s.

 

More recent tests instead showed a serious limitation when more than a few 10s (up to 100) jobs start simultaneously, as they all access conditions data from Oracle databases. Two factors differed between these tests: (1) Oracle Streams are now used to move data from CERN to Tier-1s and (2) DCS (Detector Control System) data are now accessed by reconstruction tasks.

 

ATLAS started a task force to analyse data access patterns from the Oracle server side with ATLAS and CERN DBAs. It also includes the work to instrument Athena to log database access and data volumes and on detector code developers to revise and optimise their database access

Patterns.

 

ATLAS started to explore SQLite technology for reprocessing tasks: Dump all data for a given run to a SQLite file and use it locally for all jobs. This reduces the database access by a factor of several 100s (the number of files in a run).

 

Oracle will still be needed at the sites for other activities.

7.4      ATLAS Disk Usage

Below is the disk pledges (dotted line), the installed capacity (green line) and the usage of the VOs at the Sites (ATLAS in light blue).

 

 

7.5      Plans

ATLAS has several upcoming software releases:

-       14.X.Y.- will include bug fixes only for HLT/Tier-0 and Grid operations

-       15.0.0 - February 2009. Include feedback from 2008 cosmic running and will be the base release for 2009 operations.

 

The cosmic runs with the complete detector will continue till early November 2008 and restarting late March 2009. While with partial read-out will continue all the time at reduced rates.

For collision data: ATLAS will be ready to go from April 2009 for what concerns the ATLAS Software & Computing.

 

T.Cass asked whether the ATLAS software will be certified on SL5.

D.Barberis confirmed that, because the sites will be migrating, the ATLAS software must run on SL4 and SL5. Both on 32 bits versions, therefore the WN must have the 32-bit compatible libraries installed.

 

I.Bird asked whether the pre-staging issues are due to the lack of tape drives or the throughput is really not sufficient.

D.Barberis replied that is probably a combination of both problems and both disk-tape and tape-disk rates should be studied.

 

Ph.Charpentier asked how the staging will be scheduled (progressively or asking as much as possible).

D.Barberis replied that CASTOR sites want to do receive as many request as possible to optimize the pre-staging, while dCache has a different approach ad requests must be scheduled over time.

 

G.Merino asked how ATLAS will solve the issue of the pre-stage low rate. Will ATLAS repeat these tests?

D.Barberis replied that tests are ongoing and they will try with more coordination among activities (data taking, analysis, etc). They will define the achievable rate and discuss with each of the Sites in order to be ready in 2009.

 

 

J.Templon added that a dataflow diagram like the one of LHCb provided is the best description for the Sites.

J.Gordon agreed that there are many activities of RAW data storage and pre-staging and reading, the Sites need the overall rates needed by the Experiments.

 

New Action:

ATLAS should provide the overall rates expected at each Site. (Same for the other VO?)

 

 

8.   AOB

 

 

 

 

9.    Summary of New Actions

 

 

ATLAS should provide the overall rates expected at each Site. (Same for the other VO?)