LCG Management Board |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
12 January 2010 16:00-18:00 – F2F Meeting
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Agenda
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Members |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
(Version 1 – 20.1.2010) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Participants |
A.Aimar (notes), J.Bakken, D.Barberis, J.-Ph.Baud, I.Bird (chair), D.Bonacorsi, K.Bos, M.Bouwhuis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, J.Gordon, D.Heagerty, A.Heiss, F.Hernandez, M.Litmaath, P.Mato, H.Meinhard, G.Merino, A.Pace, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Invited |
M.Girone |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Action
List |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
26 January 2010 16:00-17:00 – Phone Meeting |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
No
comments received. The minutes of the previous meeting were approved by the
WLCG MB. 1.2 Changes in the MB Membership – I.BirdI.Bird
reported the changes in IT Department at CERN that will have an impact on the
WLCG MB membership: - Helge Meinhard is now
the contact person for the Tier-0, replacing T.Cass - T.Cass will continue
to participate in his new role as leader of the DB group - Service Coordination
issues will go via J.Shiers A.Aimar will stop
acting as Secretary for the MB Meeting and be replaced by Denise Heagerty M.Kasemann
will be replaced by I.Fisk as representative for CMS. I.Bird thanked them both for the
participation in the last few years. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
OPN Mandate: Experiments and Tier-1 Sites should
provide names for working with the OPN on the needs and actions needed on
Tier-1 to Tier-2 links. To be done. Already
discussed as matter arising earlier. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3.
LCG
Operations Weekly Report (Slides)
– J.-Ph.Baud
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 OverviewThis report covers the weeks 14th December
2009 to 10th January 2010. There was quite smooth running during end of
data taking and Christmas break. With a mixture of problems Few incidents leading to service incident
reports: - Cooling incident at PIC - Batch system database server overload at IN2P3 The attendance during the last week was as below.
3.2 Alarms and Availability DataThe GGUS ticket average was as usual, below is the summary for the 4 weeks, with 3 Alarms tickets in total: - Two alarm tickets submitted by Atlas: Test
CERN-PROD on 14th and 15th December - There was one alarm ticket submitted by LHCb:
Test FZK on 15th December
Slide 6 shows how most of the Availability was high (mostly green). D.Britton
commented that the unavailability of RAL for LHCb was due to error in the
LHCb SAM tests. A.Heiss
commented that for DE-KIT there was a problem with the tests of the space
tokens with dCache. 3.3 Service Incidents ReportsPIC - Cooling Problem - Failure in cooling system in the morning of
19th December (cause not known) - Ordered fast shutdown of critical services:
3D, CE, SE - Cooling problem fixed around 11:00 - Systems fully restarted at 15:00 (LFC
replication via Streams excepted) - Procedures being improved and automated IN2P3 - Batch System DB Server Overload - Local batch management system (BQS) failure
due to database server overload in the afternoon of 4th January - Problem due to a user requesting historical
information for December (2 millions entries) - User cancelled his request and resubmitted it
with wrong parameters (24 millions entries), cancelled again … - No job submission nor job status queries were
possible - MySQL restarted - No other fix as BQS will probably be replaced
by another batch system. 3.4 Miscellaneous Information- LHC produced collisions at 2.36 Tev (world record) - ATLAS collected 150TB of data with beam on, a significant fraction will be reprocessed - ALICE decided to blacklist sites not running SL5 - IN2P3 added 60 GB of disk for CMS to store software releases - BIGID problem for CASTOR at RAL - FS probe problem at RAL on one of the LHCb disk servers - Transfer timeouts between PIC and SARA as well as between Weizmann T2 and SARA due to large MTU values used M.Bouwhuis commented that
the issue was due to an OPN problem between CERN and PIC. - Job submission failures for CREAM-CE at IN2P3 - LAN upgrade and 700 TB added at BNL and was a
transparent upgrade without stopping the service. - CNAF plans to move LHCb data from CASTOR to
GPFS and TSM as already done for CMS L.Dell’Agnello
added that CNAF plans to move also ATLAS from CASTOR to TSM. - FTS problem at CERN for ATLAS and CMS: “could
not load client credentials”: is it FTS 2.2 specific or is it a deployment
issue? A.Heiss
reported that the same problem was encountered also at DE-KIT J.Gordon
asked whether FTS2.2 is the “production” version. J.-Ph.Baud
replied that it is the production version but the upgrade is not recommended
until these issues are solved. - CERN network instabilities on Friday 8th
January (problems in the core routers) |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Update on Accounting (HEPSPEC, Storage,
Installed cap.) (Slides ) – J.Gordon
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4.1 HEPSPEC06All sites were asked to benchmark their
current hardware and publish: - 89 sites publishing from 21 countries, Gstat2 knows about
349 sites. Most
countries only publish a few (see table below). Sites
measuring HEPSPEC06 convert to SI2K to publish. All APEL are normalised numbers in SI2K CESGA
Portal asked optionally to convert when displaying. It is on their roadmap) but have no plans yet for a mass conversion
of raw data. They could convert it in the monthly reports.
4.2 R-GMANew
version of APEL Client which uses ActiveMQ to transfer data to central
repository and is under certification. While the new version of the central
server which accepts data from ActiveMQ as well as R-GMA and is being readied
for production. Once
these two tasks are complete alternative publishers can be developed (OSG,
DGAS, etc). Regional
repositories will not be available before EGI. 4.3 Storage AccountingGstat2
has developed collectors for the new storage schema. New RAL staff member
started to work on storage accounting in December and will present
specifications on February’s GDB for comments. It will
harvest data from BDII so Nagios sanity checks are also relevant here. One
can check the results now in gstat2 at this URL: http://gstat-prod.cern.ch
4.4 Installed CapacityThese
values were sanity checked: - Installed cpu numbers are known to gstat2 - SI2K totals known to Gridview for the reliability reports - Online and Nearline site storage totals are available to gstat2 Regular
reports for T1 and T2 are on the on gstat2 work plan but to start they will manually merge with CPU
reports to start. No dates known for the implementation. The
patch for APEL client to use new normalisation value is in certification. It will allow sites who normalise
CPU in their batch system in order to publish the correct CPU capacity 4.5 User DN PublicationThe
CESGA portal shows the Sties not publishing the User DN and they should be
reminded that it is not acceptable. I.Bird asked that regular reports are
produced on the different metrics mentions in the talk. Only with regular
reporting the Sites will act. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5.
Update
on Multi-User Pilot Jobs (MUPJ) (Slides)
– M.Litmaath
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5.1 Technical Forum Working GroupThe Tech
Forum started right after
December GDB. There are now 59 members, and others welcome Here is the group: http://groups.google.ch/group/wlcg-tf-pilot-jobs
with almost 200 messages so far. A Summary Wiki went
through 6 versions https://wlcg-tf.hep.ac.uk/wiki/Multi_User_Pilot_Jobs To view that page the browser needs an IGTF cert loaded. The
questionnaire was sent to T1+T0 representatives yesterday and the majority of
responses are expected in 2 weeks. The next activity
foreseen in the WG is to produce recommendations. 5.2 Summary Wiki1. Introduction - What are pilot jobs? - Single- vs. multi-user pilots - What is gLexec? 2. Boundary
conditions - Mainly JSPG policies http://www.jspg.org/wiki/JSPG_Docs - Some adjustments could turn out to be
desirable, as some wording do not match the policy 3. Benefits of pilot
jobs compared to "classic" jobs - Also single- vs. multi-user pilots 4. Issues for
efficient/correct scheduling of pilot jobs - A single class of pilot jobs may not be a
panacea 5. Drawbacks of
multi-user pilots - Mainly issues surrounding gLexec 6. Multi-user pilot
jobs with identity change - Pro: complete separation of users - Con: setuid complications 7. Multi-user pilot
jobs without identity change - Pro: no setuid complications - Con: incomplete separation of users 8. Legal
considerations - Some sites may have more constraints than
others 9. Virtual machines
(one VM per job) - Will simplify matters A
short discussion followed but M.Litmaath suggested checking the explanations
in the Summary Wiki. 5.3 QuestionnaireA questionnaire was sent to all Sites in order to understand clearly the situation on MUPJ. 1. Does your site
policy allow the use of multi-user pilot jobs by the LHC experiments you
support? (no/depends/yes) - If no, why? 2. Does your site
policy support the use of gLexec in setuid mode? (no/allow/require) - If no, why? 3. Does your site
policy support the use of gLexec in log-only mode? (no/allow/require) - If no, why? 4. When gLexec
returns an internal error (e.g. SCAS/Argus/GUMS temporarily unavailable),
does your site policy allow the pilot to continue and run the payload
itself? (no/depends/yes) - If depends, on what? I.Fisk
not that the reliability of GUMs and SCAS/gLexec are relevant whether one
uses MUPJ or not. And they should be evaluated. M.Schulz
replied that for the moment the SCAS solution is not as reliable as it will
be, but will improve. For the moment SCAS may be a bottle neck for a Site. J.Templon
noted that Experiments could exercise SCAS for a fraction of their jobs and
one could evaluate the reliability. Ph.Charpentier
replied that the Experiments do not have time for those tests. Action: Tier-1 Sites should collect the answers on the MUPJ
questionnaire from their Tier-2 Sites. They are expected to report on the
answers form their countries. Ph.Charpentier
noted that there could also be a question of MUPJ without gLexec. M.Litmaath
replied that it is implicit and Sites can have their additional comments. He
reminded that in case of legal issues one must should that had done all that
is possible to correctly collect authorization information. J.Templon
noted that would be better to have some external body assessing/certifying
the security situation. I.Bird
noted that the situation is good and we have done the necessary due diligence
on the matter. And the advice given by expert of the policy group which said
to use for example gLexec. D.Barberis
noted that some Tier-2 will not have a reference Tier-1. I.Bird
suggested sending the questionnaire to the Collaboration Board mailing list
even if the Tier-1 will report on their Tier-2 Sites. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6. ALICE Quarterly Report (2009Q4) (Slides) – Y.Schutz |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Y.Schutz presented the Quarterly Report 2009Q4 for ALICE. 6.1 Data TakingData taking with all installed detectors
from the first collision on - 1Mio collision events - 365 GB RAW at the Tier-0 - Replicated two times in external Tier-1 Sites; but only after end
of data taking. The data
migration strategy was changed. During data taking data are migrated from the
DAQ disk buffer to the ALICE CASTOR disk pool (alicedisk) for temporary
storage. The data are then optionally migrated to the CASTOR permanent data
storage (t0alice). CASTOR v.2.1.8 was extremely stable throughout the data taking 6.2 Data ProcessingPilot reconstruction and analysis is performed for a fraction of the run, typically a few thousands events, on the CAF as soon as data are transferred to CASTOR and registered in the AliEn file catalogue. This method provides quick feedback to run coordination on data quality. Data reconstruction is automatically launched (first pass) at T0 at the end of the run and ESDs are available for analysis a few hours later in 3 SE. The first pass reconstruction success rate was ~96%. Second pass reconstruction has been run during Christmas break at T0+T1s. Second pass reconstruction success rate ~98%. The analysis trains have run several times over the entire set of reconstructed data of pass1 and pass2. MC production was executed at all sites with several ‘Early physics’ production runs with RAW conditions data. In average 2600 jobs were running concurrently. 6.3 ALICE SoftwareA number of fixes were needed to the algorithms, when confronted for the first
time with real collision data. The main issues found regarded high memory usage ~ 4 GB for RAW reconstruction.
The main effort is concentrated to reduce the memory usage; the aim is 2 GB,
after calibrating the detectors with the collected data sample. A new release
was done on January 15. 6.4 Services: SL5 and CREAMPriority was given to SL5 migration and all Tier-1 Sites and most of the T2s have migrated. Four T2s are still blocked (Athens, PNPI, UNAM, Madrid) because they do not run SL5. All is expected to be done by the end of this month. CREAM CE deployment reached 50% of the ALICE sites without progress recently. ALICE still must run dual submission CREAM/WMS still the norm and is not the desired setup by ALICE and they continue to work with the sites on deployment of CREAM. 6.5 MilestonesThe ALICE updated milestones are: - MS-130 15 Feb 10: CREAM CE deployed at all sites - MS-131 15 Jan 10: AliRoot release ready for data taking 6.6 ConclusionsThe first data taking period has been a full success for ALICE in general and for ALICE computing in particular data flow and data processing went as planned in the Computing Model. The Grid operation has been smooth and the sites delivered in general what they have pledged. Two main concerns remain and are the excessive usage of memory prevents us to run efficiently at all T1 sites and achieve uniformity of the submission system (i.e. CREAM at all Sites) before start of data taking. F.Hernandez
asked whether the MC jobs still require 2 GB and whether this is specified in
the job descriptions for the CE. Y.Schutz
answered that only the reconstructions jobs require for 4 GB, temporarily,
and in IN2P3 cannot run for the moment. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7. ATLAS Quarterly Report (2009Q4) (Slides) – D.Barberis |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7.1 Tier-0 and data-taking activities
In October 2009
ATLAS took global cosmics. In Mid-November at
the start of LHC data ATLAS was ready, with open trigger, with low
thresholds, and with the full calorimeter read-out. Data was made of 5
MB/event on average. The instantaneous
rate was limited to 800 MB/s, but average event rate very low anyway. This produced Large
RAW, but small ESD. Cosmics runs are
interleaved with LHC runs when the machine is off. They are needed (together
with beam halo) for detector alignment to constrain the weak distortion modes
that cannot be constrained by tracks originating from the collision point. I.Bird asked whether the storage of cosmics
data was already planned originally and required. D.Barberis replied that this is within the
existing requests of ATLAS. ATLAS has accumulated almost 1 PB including replicas. All data was processed in real time at Tier-0 and there were no surprises wrt. MC events in terms of CPU and memory. 7.2 Data Distribution Pattern and PerformanceRAW All RAW data was sent to disk and tape in each Tier-1 by Tier-1 share. Moreover all RAW go to disk at BNL, Lyon and SARA. Normal is tape in each Tier-1 by Tier-1 share and no extra RAW data to disk at CERN except for the CAF. ESD All ESD to disk in each Tier-1. Normal is to have two copies distributed over all Tier-1s, with full ESD copy to disk at CERN and ESD data to disk in Tier-2s by Tier-2 share. AOD and dESD, skimmed data, to disk in all Tier-1s. Normal is 2 copies kept in all Tier-1s only. Copied to disk in Tier-2s by Tier-2 share (total ~18 copies). Normal is 10 copies in the Tier-2s only Additional copies will be reduced dynamically to make room for 2010 data. Below is the total data throughput over a month in November and December. All data were delivered to Tier-1s and Tier-2s using open datasets. RAW during data-taking (run in progress) while ESD etc during Tier-0 processing, as soon as outputs were available. Data were available for analysis at Tier-2s on average 4 hours after data-taking; including the time for Tier-0 processing 7.3 Data ReprocessingAn "ultra-fast" reprocessing campaign was run on 21-31 December 2009 using the last Tier-0 software cache plus a few last-minute bug fixes (release 15.5.4.10) and most up-to-date calibrations and alignments for the whole period. Only 22 RAW->ESD jobs failed out of 130148 and 27 ESD->AOD jobs out of 10001 A few software bugs being followed up, affecting beam splash events. Next reprocessing round will take place in February using release 15.6.3.X built now. It will also be a test of releases 15.6.X.Y to be used at Tier-0 next month. All was on SLC5/gcc4.3 only. 7.4 Simulation ProductionSimulation production continues in the background all the time. It is only limited by physics requests and the availability of disk space for the output. Parallel effort is underway for MC reconstruction and reprocessing. It is including reprocessing of MC09 900 GeV and 2.36 TeV samples with AtlasTier0 15.5.4.10 reconstruction, same release as for data reprocessing. 7.5 Analysis Data AccessData were analysed on the Grid already from the first days of data-taking as shown below. Several "single" users submitted event selection and/or analysis jobs on behalf of their performance or physics working group. Output is made of ntuples that are copied to group space and then downloaded by end users In this work model the number of real Grid users is somewhat underestimated. 7.6 PlansRestart data-taking with separate detector runs during January, with no data export. ATLAS will start global cosmics run first week of February, with start of Tier-0 data processing and export. ATLAS is ready for LHC beams mid-February J.Gordon asked whether ATLAS expects more Sites to move on SL5. D.Barberis replied that the situation as it is fine for ATLAS.ATLAS expects all sites to have moved to SL(C)5 by now. The software release that is in validation now, to be used at the re-start of data- taking, will be distributed only for the SLC5 (and compatible) platform. F.Hernandez
asked about ATLAS testing the CREAM CE. D.Barberis
replied that ATLAS will do its CREAM tests when it is tested by the other
Experiments. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8. CMS Quarterly Report (2009Q4) (Slides) – I.Fisk |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8.1 Introduction
The CMS Distributed
Computing System generally performed well with the addition of collision data.
The data rates and sample sizes are still quite low and the system was not
resource constrained during this early period. The workflows and
activities were generally what was expected from the computing model; but the
workflow could be executed much more frequently: - Data Multiply Subscribed. More T1 (four) and
T2 (about 13) subscriptions than would happen with more data. - Re-processing occurred every 2-3 days instead
of every couple of months. Data Reconstruction,
Skimming, Re-reconstruction at Tier-1s could run in parallel with distributed
user analysis and MC production at Tier-2s. 8.2 Data Collection Infrastructure
Tier-0 and Tier-1
Re-reco and Data Distribution Systems functioned with early collisions. Events
were reconstructed and exported to Tier-1 sites, the express stream latency was
at target levels and data was re-reconstructed using Tier-1 centers. In
addition the Prompt Skimming system was moved into production. The Tier-0 Facility
had been routinely exercised with cosmic data taking and simulated event samples
and was performing stably with Cosmics data with very few failures as shown
below. Below with Collisions, with the failures concentrated in setup phase. The ~3000 cores at CERN were all used with local submission to farm with multiple workflows Overall there was very good stability and performance of the CMS software. And CMS received confirmation from CERN on T0+CAF pledge in 2010. 8.3 Data Distribution and Access
Below is the data
distribution from CERN or a Tier-1 going to destination at another Tier-1. Below instead is
with source a Tier-1 going to a destination Tier-2. 8.4 Site Stability
Tier-1 Readiness November and December is shown below. With mostly the 7 CMS Tier-1 ready. Readiness is defined as passing the CMS, SAM, Job Robot, and Transfer tests for a high percentage of a time window. About 40 Tier-2 are constantly
available but has not improved much over the two months. 8.5 Activities over the end 2009 Break
Data Processing
Activities during the break were the following: Re-processing and skimming of all good runs
finished on 12/24 for the two large physics datasets - ZeroBias 22M RAW events, 1019 files processed.
- MinimumBias RAW 21.5M events, 1207 files
processed. Almost problem-free
processing of high-quality data. For the latest CMSSW
version only one of >2000 job failed due to memory consumption all was
done within 4-5 days. 8.6 MC Production
Smooth MC Production
over break with 120M events produced (RAW, RECO, AOD); including special
MinBias samples for comparison with 900GeV and 2.36TeV data. Most was with Full
Simulation, some with Fast Simulation 8.7 Current Activities and Improvements
CMS Operated in an environment without resource constraints; the data rate and complexity is lower than expected in the final system and this allows many more passes and caused some complaints about lack of utilization.The number of users is also lower While we see the
ability replicate data to Tier-2s. We are taking advantage of the
oversubscription. This is needed to anticipate achieving good performance
Tier-1 to Tier-2 when the data is accessible from fewer places. The CMS Computing
TDR defines the burst rate Tier-1 to Tier-2 as 50MB/s for slower links up to
500MB/s for the best connected sites. We have seen a full spectrum of
achieved transfer rates. The Average Observed Daily Max peaks at the lower
end. From the size of the
facilities and the amount of data hosted, CMS has planning estimates for how
much export bandwidth should be achievable at a particular Tier-. No Tier-1
has been observed to hit the planning numbers (though a couple have
approached it) CMS would like to
organize a concerted effort to exercise the export capability. Need to work
with site reps, CMS experts, FTS and Network experts. This is an area for
collaboration. 8.8 Outlook
In conclusion the Distributed
Computing worked well during the opening collision data for CMS. CMS thanked CERN
and the Tier-1 sites for keeping things working. Some items are to be followed
up on because are not yet working at the rates anticipated in the planning. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9. LHCb Quarterly Report (2009Q4) (Slides)
– Ph.Charpentier
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
9.1 Activities in 2009Q4
This is a summary of
the LHCb activities: - Core Software: Stable versions of Gaudi and
LCG-AA - Applications: Stable as of September for real
data. Fast minor releases to cope with reality of life. - Monte-Carlo: Some MC09 channel simulation. Few
events in foreseen 2009 configuration. Minimum bias MC09 stripping - Real data reconstruction: As of November 20th Below are the Number
of Jobs per day and CPU Usage by Job Type. 9.2 Experience with Real Data
Very low crossing
rate with a maximum of 8 bunches colliding (88 kHz crossing) with very low
luminosity. Minimum bias trigger
rate: from 0.1 to 10 Hz. Data was taken with single beam and with collisions.
Only 217 GB.
9.3 Read Data Processing
Iterative process
with small changes in reconstruction application and improved alignment. In
total 5 sets of processing conditions and only the last files were all
processed twice. Processing
submission Automatic job creation
and submission after: - File is successfully migrated in Castor - File is successfully replicated at Tier1 If job fails for a
reason other than application crash - The file is reset as “to be processed” - New job is created / submitted Processing more efficient
at CERN (see later) - Eventually after few trials at Tier1, the
file is processed at CERN No stripping - DST files distributed to all Tier1s for
analysis 9.4 Issues Encountered
A few issues were
encountered by LHCb: - Castor migration: Very low rate: had to change the migration algorithm for more frequent migration - Issue with large files (above 2 GB): Real data files are not ROOT files but open by ROOT. There was an issue with a compatibility library for slc4-32 bit on slc5 nodes. Fixed within a day - Wrong magnetic field sign: Due to different coordinate systems for LHCb and LHC. Fixed within hours - Data access problem when accessing by protocol, directly from
server. DCache issue at IN2P3 and NIKHEF. DCache experts working on it. Moved to copy mode
paradigm for reconstruction. Still a problem for user jobs and some sites
have been banned for analysis 9.5 Transfers and Latency
No problem observed during file transfers. Files were randomly distributed to Tier1 and LHCB will move to distribution by runs (few 100’s files). For 2009, runs were not longer than 4-5 files. Very good Grid latency in terms of time
between submission and jobs
starting running 9.6 Usage of MUPJ in LHCb (not presented, copied
from the slide)
The so-called Multi-User Pilot Jobs (MUPJ) are used by DIRAC on
all sites that accept role=Pilot - They are just regular jobs! MUPJs match any Dirac job in the central queue - Production or User analysis (single queue) - Each PJ can execute
sequentially up to 5 jobs P
If remaining capabilities allow (e.g. CPU time
left) P
MUPJ has 5 tokens for matching jobs - role=Pilot proxy can
only retrieve jobs, limited to 5 A limited user proxy is used for DM operations of the payload - Cannot be used for
job submission Proxies can be hidden when not needed DIRAC is instrumented for using gLexec (in any mode) First experience - Problems are not
with gLexec but with SCAS configuration LHCb is not willing to lose efficiency due to the introduction of badly
configured gLexec - Yet another point
of failure! - Cannot afford
testing individually all sites at once This topic has been lasting over 3 years now - Where is the
emergency? Why did it take so long if so important? Propose to reconsider pragmatically the policy - They were defined when the frameworks had not been evaluated, and VOs had to swallow the
bullet - Re-assessing the
risks was not really done in the TF (yet) - Questionnaire
leaves decisions to sites - MUPJs are just
jobs for which their owner is responsible - VOs should assess
the risk on their side 9.7 Conclusions
LHCb was concentrating
on real data even if was very few data (200 GB). But was a very important
learning exercise. A few improvements identified
for the 2010 running - Run distribution (rather than files) - Conditions DB synchronization check - Make sure Online Conditions are up-to-date Still some MC
productions with feedback from first real data E.g. final position of the
VeLo (15 mm from beam) First analysis of
2009 data made on the Grid and LHCb foresee a stripping phase for V0 physics
publications. LHCb definitely
wants to continue using MUPJs. F.Hernandez reported that IN2P3 observed
problems in accessing files using GSI DCAP, like SARA. A ticket is submitted
to the dCache team and they reproduced it. A patch is available for test but IN2P3
cannot reproduce the problem. He asked whether LHCb is still banning IN2P3. Ph.Charpentier replied that he will have to
verify offline. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10. AOB
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I.Bird reminded the MB that there is the LHCC Review on the 16th February. Sites and Experiments should check the agenda. The issue for resources for 2011 was raised but without news will be assumed as a nominal year. The 2010 calendar of the MB meeting should be fixed for the whole year. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
11. Summary of New Actions |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||