LCG Management Board


Tuesday 8 January 16:00-18:00 – F2F Meeting




(Version 1 - 7.12.2007)


A.Aimar (notes), D.Barberis, T.Bell, I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, M.Lamanna, E.Laure, S.Lin, U.Marconi, H.Marten, P.Mato, G.Merino, R.Pordes, R.Quick, M.Schulz, Y.Schutz, J.Shiers

Action List

Mailing List Archive:

Next Meeting:

Tuesday 15 January 2008 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)


1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Site Reliability Data - Dec 2007 (SR_Summary_200712) – A.Aimar


December 2007




























































































The December 2007 SAM results are available and the sites should report on the OPS results (column 1 in the table above).


New Action:

10 Jan 2008 - A.Aimar will ask for the Site Reports for December 2007.


The VO-specific SAM results seem not reliable; therefore the VO-specific tests need to be reviewed as well as how GridView calculates the reliability out of them.


I.Bird noted that a new version of GridView (being released) will better take into account the ALICE configuration. Then the results should then be correct for ALICE.


A.Aimar added that A.Sciabá is investigating the CMS results. It seems that some dummy critical tests were not properly executed or were removed.


Ph.Charpentier added that for LHCb some tests stopped working on the 21 December and needed to be restarted. One should not rely on the result of these tests for now.


New Action:

Experiments should review their specific SAM tests and see whether the GridView summary is correct.


2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 30 Nov 2007 - The Tier-1 sites should send to A.Aimar the name of the person responsible for the operations of the OPN at their site.

Considered completed with the information received so far.

Received information from TW-ASGC (Min Tsai, Aries Hung), FR-CCIN2P3 (Jerome Bernier), IT-INFN (Stefano Zani), RAL (Robin Tasker), DE-KIT (Bruno Hoeft), PIC (G. Merino), NL-T1: Jurriaan Saathof (SARA), Hanno Pet (SARA), Pieter de Boer (SARA), David Groep (NIKHEF)


New Action:

A.Aimar will communicate these names to D.Foster and ask him which are the Tier-1 sites that are not properly represented in order to address them directly.

  • 11 Dec 2007 - L.Dell’Agnello, F.Hernandez and G.Merino prepare a questionnaire or a check list for the Experiments in order to collect the Experiments requirements in a form suitable for the Sites.

Done. The follow-up from the Experiments is for the CCRC08 F2F Meeting on Thursday.

  • 18 Dec 2007 - Experiments should nominate who is responsible for the benchmarking of their applications on the machines made available by the HEPiX Benchmarking Working Group.

Done for ALICE, ATLAS CMS. Remains for LHCb.

ALICE: Peter Hristov

ATLAS: Alessandro De Salvo and Franco Brasolin for technical help.

CMS: Gabriele Benelli

  • 18 Dec 2007 - LHCb should nominate who is responsible for the benchmarking of their applications on the machines made available by the HEPiX Benchmarking Working Group.

LHCb: Asked for more information before assigning a responsible for the benchmarking.

  • 8 Jan 2008 - Ph.Charpentier agreed to distribute the 2008 LHCb needs for each site to the MB mailing list and to the LHCb national representatives.

Done at the CCRC Meeting on the following Thursday.

  • 8 Jan 2008 - ALICE and ATLAS will present at the MB F2F in January how they intend to solve the issues caused by tiny files.

Done in this F2F Meeting.

  • 8 Jan 2008 - K.Bos had prepared a document describing the ATLAS requirements for the CCRC. J.Shiers will distribute it to the CCRC list.

Done already before Christmas. And will be presented to the CCRC on the following Thursday.


3.    SRM 2.2 Weekly Update (Agenda) - J.Shiers


The first SRM meeting of 2008 took place. Here is the link to the Agenda.


Sites upgrades: should have been all done by end of 2007. All sites are on SRM 2.2 in Productions. No recent news from PIC and TRIUMF.


G.Merino reported that in end December PIC has upgraded to dCache 1.8 (i.e. SRM 2.2) in production but did not enable space management.

J.Shiers replied that the configuration of the sites is to be done during January.


Client Tools: No new bugs have been reported. All fixes are being certified. 

Experiments Testing: Minor issues, nothing worth reporting.


SRM v2.2 and space tokens: In November was agreed not to proceed with any development. M.Litmaath and F.Donno analysed the situation and circulated a paper clarifying in detail the deployment issues. Will be presented at the GDB on the following day.


4.    Update on CCRC-08 Planning (Slides; Agenda) - J.Shiers


 There will be a CCRC presentation at the GDB and a full day CCRC F2F Meeting on Thursday.


The main goals for the F2F Meeting on Thursday are:

-       Confirmation of dates for Feb & May challenges

-       Baseline versions of middleware and services used in challenge

-       Experiment requirements on sites, based on feedback from sites and on the requests from the sites

-       Monitoring, logging & reporting strategies 

-       SRM v2.2 production deployment & usage status

-       “The Metric” to use to evaluate the progress of the challenges.

-       Weekly/daily WLCG “Service Coordination” (aka run) meetings


5.    OSG Resource and Service Validation (RSV) publishing to WLCG SAM (Slides) – R.Pordes, R.Quick




OSG sites will publish results of reliability  tests to the WLCG SAM repository through

-       Running RSV probes.

-       Collecting the information at the Grid Operations center RSV-Gratia repository.

-       Publishing selected information from this repository to the WLCG SAM repository.


This information will be used by WLCG to generate site availability plots for OSG resources.

The probes developed and transport follow specifications from Grid Monitoring Working Group (thanks to J.Casey and P.Nyczyk).


Status and Schedule

RSV V1.0 is available as part of OSG 0.8.0 released Nov 1st 2007. It provides mechanisms for scheduling, status probes, and collection. And mostly focus on putting the infrastructure in place that could be easily expanded later.


~18 (out of 70) OSG sites are now reporting to the central RSV repository but can be easily expanded to more sites.

The discussions on RSV deployment and experience to OSG Tier-2 sites in next couple of weeks.


The RSV V2.0 is targeted for the release of OSG 1.0, by end of February



Slides 4 and 5 shows the kind of display that will be available to the RSV Local administrators and the Schematic Architecture of the system.






Publishing to WLCG SAM

Several months of OSG Integration Test Bed data are now transported to WLCG SAM and analyzed.

The records from OSG Production Resources now being published in SAM. Only selected test results (not all) are sent as needed.


The test set comparison between OSG and EGEE was done by David Collados (and verified by Jeff Templon?).

This is the URL where the evaluation is described


SAM will display the OSG Resource Availability based on the OSG tests. The display on SAM is up to the SAM developers. OSG is very interested in the end to end validation.


Probes Currently Available and Under Development

In RSV V1.0 these are the tests available::

-       Gram Authentication - Critical

-       OSG Version - Critical

-       Job Managers Available - Critical

-       Job Manager Status - Critical

-       GridFTP - Critical

-       CACert CRL Expiry - Critical

-       OSG Directories and Permissions - Critical

-       Certificate Expiry Host - Non Critical

-       Certificate Expiry Services - Non Critical


In RSV V2.0 will be added also verifications on:

-       Proxy handling

-       Ease of configuration

-       Extending Probe Set

-       Include SE Probes

-       Web Interface to data

-       Probe software Cache

-       Enhanced Documentation


I.Bird asked whether there are critical tests that are not in V1.0 and whether RSV 2.0 will include all those that are necessary.

And when all necessary tests will be published into SAM?

R.Quick replied that the publishing depends on the SAM development, the current OSG results are published, But tests are still missing in Version1.


J.Gordon noted that the SE tests are important and are only planned for V2.0; therefore until then there are no sufficient OSG tests.

R.Quick agreed that V2.0 is the one that will have the tests needed while V1.0 is still incomplete, from that point of view. The SE probes could be released before V2.0


J.Gordon asked whether there are tests of the SRM endpoints.

R.Quick replied that probes for SRM are probably already available in BNL and at other groups. Also the SRM tests should be produced before V2.0.


J.Gordon also asked about the Information Service tests.

R.Quick replied that the Information Services (BDII and the OSG Info System) tests are not available but will be implemented before V2.0.


R.Pordes asked who is the contact for the OSG SAM tests and the Availability calculations.

I.Bird replied that the calculations are made by the GridView team, but the SAM OSG contact person should be Piotr NycZyk.


I.Bird asked how VO-Specific tests can be implemented for the OSG testing framework.

R.Quick replied that the OSG tests could be run by a different VO than OPS. Also VO-specific probes can be added and executed.

R.Pordes added that the VO-specific probes are responsibility of the VOs and are run under their responsibility currently.


R.Pordes asked that the availability based on the current probes, although incomplete, should be published via GridView and re-discussed in a couple of weeks.


6.    Storage Efficiency: Tape Efficiency (Slides) - T.Bell


T.Bell presented a summary of the inefficiency issues encountered at CERN in the usage of tape storage.


The issues are causing real problems, such as:

-       User complaints - Caused by long stage-in time during challenges and sometimes resulting in making the data on tape unavailable.

-       Low batch efficiency - Caused by long queues of jobs waiting for tape data staging. And with CPU-bound jobs waiting for tape data to be read.

-       High failure rate of robotics - This kind of usage of drives and robot arms requires more maintenance resulting in tapes disabled and robots needing repair.


In order to analyse the situation tape usage data has been collected during Nov/Dec 2007 recording information on (1) the distribution of file sizes on tape, (2) the number of tape mounts and performance and (3) limiting the observation to production tapes only (no user tapes).


The main causes identified were

-       Small sizes of the files written to tape

-       Repeated mounting of the same tape

6.1      File Size vs. Tape Performance

Slide 4 shows the curve of the file sizes vs. the tape performance.

-       Tape drives need to stream at high speeds to achieve reasonable performance.

-       Per-file overheads from tape marks lead to low data rates for small files

-       LHC tape infrastructure sizing was based on 1-2GB files and that is to size to aim at.




Currently the average sizes are:






200 MB

150 MB

2200 MB

200 MB


6.2      Repeated Mounting – READING and WRITING

As Slide 5 shows the same tape is used many times for reading and writing:

-       Tapes are being repeatedly mounted/unmounted.

-       In average 5-20 times a day, with few tapes mounted up to 200 times.

-       It takes around 4 minutes to mount a tape compared to 100 minutes to write a complete tape

-       Increases wear/tear on robots and drives along with risk of tape media issues.




6.3      Repeated Mounting – WRITING

The write migration to tape is currently triggered by Castor based on the modification date of the file (typical setting is 30 minutes).

The current policy was chosen in order to write files to tape quickly; but also this leads to inefficient mounts for short times.





The write operations are not under the direct user control, but the usage has an influence on the efficiency.

One needs to move to a migration based on volume of data (one 700GB tape) to write along with a maximum delay (8 hours).

For CDR, at 100MB/s, the expected time would be 2 hours to start migration and 2 hours to complete writing to tape


6.4      Repeated Mounting – READING

Very limited pre-staging of data means that tapes are being re-mounted for each file. Small files make the situation worse.

Also the queuing overhead to get to a drive increases further the batch job inefficiency and reduces the job performance.



ALICE is doing particularly well here because, even if they write several small files, they read back only their large files.


D.Barberis asked how the ATLAS issues can be characterized. Are the directories of data are stored on the same tape or scattered over several tapes?

T.Cass replied that the logical grouping in CASTOR is “by pool” not by directory. Therefore reading back by directory can involve several tapes mounted if when stored the data was small and therefore distributed on several tapes if there are several tapes free. The small files should be all grouped in one file if they are going to be retrieved together.


M.Lamanna noted that in some cases the same group/data of data can be stored on up to 20 tapes. And this will requires opening them all 20 again.

T.Cass replied that is the data set was not made of 20 files but one big file then the efficiency would increase considerably. This is one more reason to group small files into bigger ones.


R.Pordes asked where these policies are discussed across the Sites.

T.Cass replied that this is only CERN policy on how to store data at CERN. Not a SRM development issue. Other Sites may use different batch systems, mass storage, etc and for those Sites other policies may be more adequate.

I.Bird added that in the future the GDB is the right forum for sharing experiences across sites.

6.5      Total Performance on Tapes

The total performance shown below is based on the “sum of data transferred against the total time spent on drives” (including mount/unmount time).

This is the metric proposed for the Tier-0 Site Performance.


The target performance to reach should be about 50% but currently is around or below 15%.



M.Kasemann noted that as CMS is using big files, then the problems for CMS must be the repeated mount of tapes.

6.6      Proposal and Next Steps

The proposal in order to attempt to improve the above performance is that:


Experiments should

-       Move to 2GB files for tape transfers

-       Ensure that pre-staging is standard for all applications.


Castor Operations will change policies for CCRC

-       Write policy of at least one tape of data, with 8 hours maximum delay

-       Limit mounting for reads unless at least 10GB, or 10 files requested for each tape read mount, or if a request is 8 hours old.


Monitor February CCRC performance and cover shortfall with

-       Major drive purchases and dedication for experiments

-       Fixed budget implies reduction in CPU/disk capacity


Ph.Charpentier noted that these issues can be solved in several ways and they should be discussed more in depth between CERN and the responsible of the production in the Experiments.

T.Bell and T.Cass replied that FIO is ready to present the situation and to discuss this proposal with any group in the Experiments where this is considered necessary. Actually different policies could be applied on different pools in order to adequately support the usage pattern of the tapes.


I.Bird proposed that all Tier-1 sites also collect similar data in order to identify similar or new issues with the tape storage.



The MB agreed that while further investigations and discussion continue at the GDB and with the Experiments, while the proposal to change to 8h before storing on tape can be implemented.


New Action:

9 Jan 2008 - T.Bell will distribute to the MB a set of metrics that could be also measured by all Tier-1 Sites in order to describe in a uniform way the performance of the tape storage systems.


New Action:

13 Jan 2008 – Tier-1 Sites should report on whether they can collect the metrics on tape performance proposed by the Tier-0 (T.Bell).


7.    Storage Efficiency: Small Files – ATLAS (Slides) – K.Bos


K.Bos described the way ATLAS is using the tape currently and the changes that were proposed during the ATLAS week.

7.1      Small Files in ATLAS

ATLAS wants to work with files of about 5 GB. Not smaller than 1 GB and not bigger than 10 GB.

The current average file size is now about only 50 MB; 100 times smaller, which means 100 times more files to store.


Bigger files are better for transport and for storage therefore the effort must be to create bigger ATLAS files.

7.2      Types of Files Generated

EVGEN Files - Many thousands of events generated. Files of about 100 MB. They are needed everywhere where simulation is done and are distributed to all Tier-1 sites.


EVGEN à HIT Files– HIT files are generated by Geant 4. Every event is 2MB and a HITS files contains currently 50 events, therefore a HIT file is 100 MB.


JumboHIT Files - The HIT files should be grouped in bigger files. Merging 50 HITs files into one JumboHIT file that contains 2500 events. JumboHIT files should be uploaded to Tier-1 sites as reconstruction runs in Tier-1 sites. The JumboHIT solution is planned but not yet implemented.


ESD + AOD Files - ATLAS also plan to merge the DIGI and RECO processes in order to produce ESD and AOD directly:

-       ESD = 1 MB/event, 2500 events, File size 2.5 GB

-       AOD = 0.1 MB/event, 2500 events, File size 250 MB

7.3      Local Disk Limitation

ATLAS plans to use the local disk of the WN but local disk is limited and boards now come with 4, 8,.. CPUs.

CPUs have now multiple cores but the disk size per board has not increased.


ATLAS needs space for about 13 GB per core:

-       JumboHIT files: 5GB

-       RECO working files: 5 GB

-       ESD produced: 2.5 GB

-       AOD produced: 0.25 GB




JumboAOD Files – The AOD could be grouped too. 10 AOD files input for TAG creation. In same step JumboAOD could be created. File size JumboAOD: 2.5 GB and contains 25000 events

TAG Tar Files - TAG files are also small, of the order 100 MB. Could be tar-ed before transport


Other Small Files – In ATLAS there are several other kinds of small files:

(NOT stored on tape)

-       DPDs Same format as AOD, same trfto merge

-       CBNTAA, SAN, HighPT

-       Log files: with dedicated ftp (3) servers. Tarred after use and stored with the data


(stored on tape)

-       •User files: Worry!! Don’t know what to do yet

-       And last but not least RAW data

7.4      ATLAS Data Stored at the Tier-0 Site

The table below shows all the files that will be stored on tape by ATLAS doing INCLUSIVE STREAMING.


The red arrows on the left indicate the files that are going to be stored on tape. Photon runs will generate small files, as shown in the table.


Note: The current Luminosity Block (LB) considered in the table below is 0.5 seconds.


7.5      RAW Files Merging

Problem: combination of (1min LBs, O(5) physics streams, O(5) SFOs) results in relatively small RAW files:

-       About 800MB on average

-       But the mass storage (i.e. tape) systems and data export (DDM) prefer large files


In past meetings we have already suggested and discussed several RAW file handling and merging scenarios

-       Systematic, comprehensive tests and measurements (also by Tier-1s) were planned, but have not taken place so far

-       Schedule conflicts with M* weeks and throughput/functional tests, etc.


There is evidence from past tests that CERN/CASTOR (Tier-0 setup) and DDM Tier-0 toTier-1 export are able to cope with small RAW files.


T.Cass warned that this is true only because ATLAS had all tape capacity. NA48 and COMPASS have complained directly to the DH and the current situation is not suitable for the future.


The original plan was to dedicate FDR to deciding on RAW file handling scenario and still use small, unmerged RAW files in FDR-1.

But at December 2007’s Tier-1 Jamboree there was the unanimous request of all Tier-1 sites to absolutely go for RAW file merging a.s.a.p.  


Some issues are difficult to reconcile:

-       The ATLAS Computing Model requires: archival of RAW data on tape a.s.a.p. after arrival at CASTOR

-       The file merging adds at least another 320 MB/s of Tier-0 internal writing load


-       Asymmetry of files at CERN and at Tier-1 sites if the same grouping is not done at the Tier-1 sites.
Will require extra book-keeping (mapping of small ↔merged files)


-       “Real” merging processing (on Batch System level) requires

-       Appropriate software; and CPU power

-       Careful validation of the merged file is needed

-       Can the original, small files eventually be discarded once the merge is done?


Suggestion: “Minimal asymmetric” scenario:

-       Archive small RAW files on tape a.s.a.p. after arrival at CASTOR

-       Register small RAW files with DQ2 (location: CERN)

-       Do tar-ring of RAW files in sequence with the reconstruction job
Adds “minimal” 320 MB/s writing load

-       Put merged RAW files on a temporary CASTOR disk buffer

-       Create merged RAW datasets, register with DQ2

-       Export merged RAW datasets to Tier-1s
NB: Inevitable latency of 24h-48h

-       After successful export: delete CERN copies from CASTOR and DQ2 catalogues

-       Will be tested during CCRC-1


At the MB was also suggested that the merging of small RAW files could be done at the pit adding a few resources there.


8.    Storage Efficiency: Small Files – ALICE, CMS, LHCb


8.1      ALICE- Y.Schutz

Y.Schutz reported that ALICE is investigating the issues that make them store very tiny files.


ALICE is tackling the two issues reported:

-       Use 2GB files always will require some work.
Their files size for MB files is variable and for the moment they limit it to CPU time used to generate the events. All output is grouped in two tar files but they do not reach yet 2GB limit.

-       Do pre-staging of data in order to read larger amount of data at once and reduce number of mounts.

8.2      CMS – M.Kasemann

They already have large files but need to investigate on their repeated mounting of tapes.

8.3      LHCb – Ph.Charpentier

Reported that their files should be about 2GB. Their jobs last 24 hours and created each a ~250 MB file. This will then require some complex merging. If it can be avoided by SRM configuration would better.


9.    AOB




-       Roundtable on the Activities during the Holiday Period: is postponed.
As there were not apparent issues raised, the item will only be kept in the agenda for next week, in case there are issues to report.


10. Summary of New Actions


The full Action List, current and past items, will be in this wiki page before next MB meeting.


10 Jan 2008 - A.Aimar will ask for the Site Reports for December 2007.


Experiments should review their specific SAM tests and see whether the GridView summary is correct.


A.Aimar will communicate these names to D.Foster and ask him which are the Tier-1 sites that are not properly represented in order to address them directly.


9 Jan 2008 - T.Bell will distribute to the MB a set of metrics that could be also measured by all Tier-1 Sites in order to describe in a uniform way the performance of the tape storage systems.


13 Jan 2008 – Tier-1 Sites should report on whether they can collect the metrics on tape performance proposed by the Tier-0 (T.Bell).