LCG Management Board
Tuesday 27 May 2008, 16:00-17:00 – Phone Meeting
(Version 1 - 30.5.2008)
A.Aimar (notes), K.Bos, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, S.Foffano, F.Giacomini, J.Gordon, F.Hernandez, M.Kasemann, O.Keeble, M.Lamanna, H.Marten, G.Merino, A.Pace, Di Qing, Y.Schutz, J.Shiers (chair), R.Tafirout, J.Templon
Mailing List Archive
Tuesday 29 May 2008 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 High Level Milestones for 2008 (HLM_20080526)
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
- 31 March 2008 - OSG should prepare Site monitoring tests equivalent to those included in the SAM testing suite.
- J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.
Done in the previous MB meeting.
The OSG tests are described here: http://rsv.grid.iu.edu/documentation/help/.
The proposed new list of critical tests is available here:
- 31 March 2008 - ALICE, ATLAS and CMS should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.
Not done but will be removed, as agreed last week.
Sites and Experiments will have to rais specific issues when needed.
M.Kasemann commented that CMS will send updated values to the Sites during this week.
No comments from ATLAS and ALICE.
H.Renshall is back and A.Aimar will ask him for an update to the MB in the coming weeks.
Obsolete. Replaced by the “Operations Alarms Page” action.
A new page for alert emails for the sites, not the same as the contact pages. This is a separate page to be prepared.
Is not the same and the contact page already prepared (https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails).
Should be a different list only for alerts and only for a few people from the Experiments.
Removed. Will be included in the wiki page to fill.
M.Schulz reported that few sites had agreed to deploy the LCAS solution. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it.
Other sites that want to install it up should confirm it.
Ph.Charpentier asked that the PPS installations should allow access to the existing production data.
Done: The page for Operations Alarms is here: https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage
31 May 2008 - Sites and Experiments
should complete the Operations Alarms Page
3. CCRC08 Update (CMS CCRC08 Phase 2 elog; Draft Agenda of June Post-mortem Workshop; LHCb CCRC08 meeting of May 26; Slides)- J.Shiers
J.Shiers presented the weekly summary of status and progress of the recent CCRC08 activities.
No comment received. For the Experiments the MB contacts are set as default speaker and they should send the names by the end of the week.
3.1 CCRC08 May Report
This one was the last week of the CCRC08 challenge. The report focuses on the basic Grid infrastructure and on the Sites. And not on the Experiments impressive results. The report from CMS (Link) and LHCb (Link) are available.
NIKHEF: increased installed disk by 120TB. Now have 88TB of space allocated to ATLAS. Problem with cooling last week forced the power-off of WNs. DPM 1.6.7 problem (lifetime reset to 1.7 days after resize of space reservation) – upgraded to latest production release that was not the one for the CCRC08 exercise. This was discussed prior to start of CCRC ’08 – but insufficient time for adequate testing.
RAL: closed for public holiday on Monday 26th (came up in relation to planned LCG OPN intervention). Problems will be dealt with by on-call system.
LCG OPN: There will be a software upgrade of the second CERN router that connects to Tier1s.This is to correct a bug affecting routing for backup network paths. It will need a reboot, hence cause a downtime of ~5’ per router. It was agreed to go ahead with the upgrade on Wednesday if no problems are seen before then. Reports are here: http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
Phone Conferencing: Problems seen with the Alcatel phone conferencing system due to “the SIP service crashed on one of the servers”.
Hopefully this will be monitored and alarmed; currently the level of support is no adequate yet..
Core Services: The details are on slides 5 and 6. At the meeting was only mentioned that the main issues were:
- Several problems and actions on Data services in relation with the ongoing CCRC tests.
- A long standing (since beginning of CCRC) CASTORCMS/T1 transfer GC problem was finally understood by the CASTOR development team.
- Additional SRM problems were seen over the weekend. The email@example.com e-mail list was used to report these problems but further clarification of the follow-up is required. The problem was solved very quickly, but actually in a way unrelated to the alarm.
- Tested the emergency site contacts phone number for TRIUMF, which high-lighted two problems. The printed phone numbers in B28 are out of date. And the people in the TRIUMF control room did not seem to be aware of the procedures for informing (Grid) contacts at the site.
R.Tafirout acknowledged the issue and reported that it had been fixed. The instructions are that if CERN calls TRIUMF R.Tafirout should be called immediately.
T.Cass noted that the message should contain both (1) instructions for the operators but (2) also a detailed description for the experts where the message will be forwarded.
- The capture process on the online database for the ATLAS Streams environment was aborted on Tuesday 20th around 02:23 due to a memory problem with the logminer component. The system was back in sync already several minutes after restart.
- High-load has been seen from the CMS dashboard application and traced to a specific query. Follow-up and tuning are planned.
- Various problems are affecting the SRM DB – high number of connections, threads, locking issues etc. Still being investigated but clear actions should be decided. The implications and actions will be discussed with the Experiments in the immediate future.
LHCb reported some issues at different sites:
- Long standing problems at IN2P3 (a gsidcap door issue?) and RAL, where also jobs requiring to access files from WN crash (to be still understood).
- The “File access” issues continue at some sites, is also starting to be a problem at NIKHEF, since Monday.
- Encouraging (first) results from the procedure of first downloading data into the WN and then accessing it locally.
- WN /tmp issue (FZK, NIKHEF) with files still used by running jobs but being cleaned up too soon by the site’s clean-up scripts.
H.Marten reported that the /tmp problem is now solved at FZK.
J.Templon added that NIKHEF prefers not to change the clean up scripts but rather to move the files for a minimum amount of time out of /tmp.
F.Hernandez added that IN2P3, other Sites and Experiments had agreed that jobs should not create files to keep in /tmp.
Monitoring: The GridMap service was moved onto new hardware and receives about 25K hits per day. It was also agreed to do systematic follow-up on problems reported and to check whether they were picked up by monitoring?
Some examples shown in slides 10-13.
A summary page of the LHC should also be known and consulted.
CCRC08 Monitoring by Experiments - Experiment “shifters” use Dashboards, experiment-specific SAM-tests (+ other monitoring, e.g. PhEDEx) to monitor the various production activities. Problems spotted are reported through the agreed channels (ticket + elog entry).
The response is usually rather rapid – many problems are fixed in (<)< 1 hour. A small number of problems are raised at the daily (15:00) WLCG operations meeting. They review on a weekly basis if problems were not spotted by the monitoring. With time the goal is to increase automation.
Possible Improvements before data taking
- Tier2s: MC is well run in, but distributed analysis still has to be scaled up to much larger numbers of users.
- Tier1s: data transfers (T0-T1, T1-T1, T1-T2, T2-T1) now well debugged and working sufficiently well; reprocessing still needs to be fully demonstrated for ATLAS.
- Tier0: Data / storage management services still suffer from load / stability problems. These will have to be carefully watched during initial data taking. Only a very few configuration changes are now feasible before data taking.
4. ALICE QR Presentation (Slides) Y.Schutz
Y.Schutz presented the ALICE’s Quarterly Report for the period March-May 2008.
4.1 MC Production
ALICE is progressing in the continuous MC production: 15 production cycles were completed, including a first iteration of MC for first Physics. And a second iteration of MC for first Physics pending different detector misalignment.
The storage used was of 160 Mio MC events; 80 TB of ESD data. The output data was carefully tuned to save storage. Most of the data is available on T2 SEs for end users analysis.
Below is the list of Storage Elements deployed by ALICE so far.
The Analysis activities also continued.
The End User analysis uses a primary copy is at Tier-2 sites SE and jobs run at the T2 holding the required data. It is a single user analysis jobs yield low CPU/Wall efficiency because is always I/O bound.
Started the ALICE Analysis Train (organized analysis) where many tasks process a single data stream and this is expected to reach the nominal CPU/Wall efficiency.
The CAF analysis, running PROOF and using CAF local data. Data is imported from the Grid and stored locally on PROOF nodes and working on the Physics Working Group choice data sets.
4.1 Raw Data Activities
ALICE tested Data Taking and online activities and data movement:
- Online data replication at to the ALICE Tier-1 sites worked well.
- Online reconstruction is working
- Online first pass reconstruction was not successful because of the frequent changes in the AliRoot code
- Condition parameters calculation was mostly working (DAQ, HLT, and DCS) and the conditions framework is fully operational.
- Monitoring and QA framework is operational but not all detectors have completed their implementations.
The Data was reconstructed offline after data taking. Pass 1 reconstruction processed at T0, with data stored in CASTOR2 at CERN (for a total size of ~20 TB).
Of all data 98% are reconstructible, with a total RAW size of about ~200 TB. The End user analysis was performed by detector experts (on Grid, local and CAF)
The Data was replicated to T1s – 90% quasi online, during data taking, the remainder offline. Pass 2 reconstruction being processed at T1s, with data in the MSS at the T1s, with very careful evaluation of output data to store.
A more strict release policy was implemented. 2x versions for cosmic reconstruction, every second week.
The current version for MC production for first physics:
- firstname.lastname@example.org/10 TeV, w/wo B field
- Detectors as installed
- Store minimum needed for analysis
- Size of ESD/AOD within Computing Model values
The open issues concerning AliRoot are: :
- The raw data format is not yet final and few detectors have not completed their description.
- Aggressive optimization on keeping memory footprint below 2GB/process (<1.5 GB on 32 bit nodes and < 2.5 for 64 bit nodes)
4.3 ALICE Analysis Train
The ALICE Analysis Train provides access to all analysis platforms with the same user code:
- Usage: local, AliEn(Grid), CAF(PROOF)
- Wagons (user code) provided by the PWGs
It was tested with large scale analysis of PDC08 data aiming at the optimization of Physics WG code ongoing with MC first Physics data.
4.4 CAF Analysis
Currently the events are grouped in PROOF datasets generated from Grid file collections. The files can be staged from any Grid SE and ALICE has implemented user quotas for disks and CPU resources. The CAF/PROOF combination provides to the users the look and feel of an interactive session.
CAF/PROOF is critical for fast task analysis of RAW, calibration and ESD data. And its success has increased the number of users. ALICE hopes that the support to PROOF is not interrupted.
4.5 ALICE Grid Services
The ALICE Grid services were fully migrated to SLC4, 64 bit VO box installed in all sites with 64 bit WNs. The new AliEn version (v2‐15) has being deployed at all sites (67 in total). It includes improvements in package management, job management (user quotas and priorities) and now secure access to storage fully implemented.
The ALICE resources accounting
- Per process based – individual CPU types are taken into account
- SI2K scaling factors taken from HEPiX benchmarking, discussed with site managers
J.Gordon noted that the correct data is from APEL and uses the scaling factors communicated by each site in the BDII servers. The value in the BDII takes into account the variation of nodes across the resources at the site.
J.Templon added that the benchmarking done by ALICE gives different values than the SPEC benchmarks run by the sites themselves.
There are discrepancies noticed with the WLCG accounting in T1s and T2s and with the ones from the WLCG accounting in T1.
- Pledges different from the ones at October C‐RRB
- Sites without VO box deliver resources
- Delivered are systematically different than pledged
A harmonization is needed in order to allow the Scrutiny group to work. ALICE trusts its own accounting.
For instance in the table below:
- Tier-1 accounting vs. the one of the October RRB (circled in red)
- And what is declared delivered to ALICE vs. what is measured by ALICE MonALIsa accounting (circled in green)
The difference in CERN is not a problem because ALICE does not include the CAF resources.
Also for Tier-2 sites the difference are very relevant See the table below.
There are sites providing resources to ALICE that do not have a VO Box installed, therefore seems impossible (in red below).
J.Gordon explained that the sites account all the DN that belong to ALICE as ALICE work. Even if users do not connect using the ALICE software. The sites can only account jobs for each user DN and if the user is associated to the ALICE VO it is accounted on ALICE. Sites do not have any information on the work performed.
Y.Schutz asked how those jobs are submitted because, in principle, it should be impossible.
T.Cass added that the proper CERN accounting calculates all the time in which an ALICE jobs occupies a slot on a WN node, while the ALICE accounting system only considers the time from when the job actually starts.
J.Gordon added that the calculation of the SPECint value by ALICE does not match the values of the benchmarks used for accounting.
Ph.Charpentier replied that some sites publish the average benchmark per node others the smallest benchmark value of their CEs. This can cause a difference even of a factor 3 between accounting and reality.
J.Templon clarified that SARA normalizes CPU time “per node and per job” before publishing it into APEL. And they are not using what is in BDII, which allows only one number per site.
4.7 ALICE Resources
This is the most critical issue in 2008 for ALICE.
- Requirement 7.4 PB, Allocated 3.3 PB (CERN) + 2.98 PB (at external sites)
- Already used at CERN 0.6 (MC) + 0.2 (Raw) + 0.2 (not Grid aware). at external sites: 0.09 (MC) + 0.2 (Raw) + unknown (not Grid aware)
Currently there is still potentially available: 2.3 PB (CERN) + 2.68 PB (ext).
ALICE needs until end of the year for raw data and first Physics MC: 3.6 PB.
The margin of 1.3 PB is not sufficient for MC in preparation of run 2009. The problem was anticipated and announced since years and deleting data is not a solution for ALICE.
CPU and Disk
In April 2008 only 30% of the pledged CPU power is available or usable by ALICE.
Only 22% of the requested disk capacity is available in WLCG SEs. Out of which 22% is already used for MC data.
Network from Tier-0 to Tier-1 sites
In 2007 ALICE have requested a bandwidth of 60 MB/s which corresponds to a STYDT (C‐TDR).
In May 2008 the ALICE management has decided to take data in 2008 (slide is incorrectly saying 2007) at the maximum rate allowed by the DAQ bandwidth: 300 MB/s.
Can the T0 ALICE MS cope with this rate? Is there enough bandwidth left for export? Otherwise the pp data processing strategy might not work.
J.Shiers asked how ALICE can expect to be able to obtain 5 times (300 vs. 60) more bandwidth while the storage is already not sufficient.
Y.Schutz replied that the storage calculations in the section above already have been scaled up to the new requested 5x rate.
J.Shiers noted that therefore, at the agreed rate, the pledges are actually fulfilled, the lack of storage is due to the “5x” increase from ALICE. The CERN MSS can handle 300 MB/s without problems but this implies also a 5x amount of tapes and this cannot be a decision taken by ALICE in May 2008.
T.Cass agreed and added that at the originally agreed rate there are enough storage resources. The lack of resources is due to this change in the expected data rate.
4.8 ALICE Milestones
- MS‐122 Oct 07: FDR Phase II - Done
- MS‐124 Feb. 08: Start of FDR Phase II - Done
- MS‐125 Apr 08: Start of FDR Phase III & CCRC08 - Postponed to June 2008 (delayed because of detector readiness)
- MS‐126 Feb 08: Ready for CCRC 08 - Done
- MS‐128 Jul 08: ready for data taking
5. GDB Summary (GDB_200805) J.Gordon
J.Gordon highlighted the main point of the April’s GDB meeting.
J.Templon asked why the GDB Summary is still needed: most MB members also participate to the GDB anyway.
A.Aimar and J.Gordon replied that usually the summary only highlights the issues that are relevant to the MB and items to follow further.
The MB members should read the full report but the main points that Gordon wanted to highlight were:
- The MB should suggest what the Tier-2 specific GDB should cover.
- The recent security challenges have failed at two Tier-1 sites (CNAF and AGSC) and this should be known by the MB
L.Dell’Agnello stated that, as commented at the GDB, the security challenge was incorrectly prepared and the alarm was a simple email with subject “This is a test”. He even deleted the message immediately as all “test - please ignore” emails just testing mailing lists.
J.Templon respectfully disagreed because only CNAF had this as an issue. The email was only unclear to 1 site out of all 11 sites.
J.Gordon added that also other 2 Italian sites failed to react to those security emails.
L.Dell’Agnello replied that the security emails are always expected with a clear acknowledged sender and with a subject that is clearly mentioning an incident, not a test.
J.Shiers added that some action should be planned to check again the situation. Maybe a new security challenge for ASGC and CNAF.
- Several JSPG security policies will arrive to the MB for approval
- The Pilot Job WG is making slow progress. The main problem is to combine and complete the gLite/SCAS solution.
F.Hernandez asked how the request of implementing an email address accepting signed emails combines with the strategy agreed at the GDB of moving to GGUS for all alarms.
J.Gordon replied that the GGUS solution will take until July to be in place. But an interim solution is needed already now. Hence the page with the emails for posting operations alarms.
F.Hernandez replied that the sites will need to implement scripts to filter the emails just for an interim solution until June and the abandon it.
J.Gordon replied that the sites can implement it as they want to implement it. The DN are published on the wiki page and sites handle them as they consider appropriate and automatically or manually. After July will be only the DN from GGUS that will be allowed to post alarms.
T.Cass agreed with F.Hernandez that the signed list was introduced later. So sites should be free to manage the list as they consider it useful.
J.Shiers noted that actually this list was asked at the sites in January and escalated to the MB because it was not implemented. But it is really needed before the GGUS solution appears and sites should do it.
R.Tafirout asked that someone summarizes what the decision is.
J.Templon volunteered to provide this summary.
1 June 2008 - J.Templon will summarize the proposal of the interim solution for treating Operations Alarms until a GGUS-based solution is available.
O.Keeble summarized the progress and plans regarding the porting of the gLite middleware to SL5 worker nodes.
The aim is to port the full gLite software stack to SL5 but starting from the WN packages.
The current proposal is to keep the code of gLite 3.1/SL4, updating only a few externals packages (VDT 1.10 and Java 1.6 maybe).
The work will progress with the other activities already in progress (gLite updates, CREAM, SCAS, Debian, etc).
There is already a web page where the porting is always reported.
6.1 Build Timeline for SL5 Porting
The build strategy has been agreed with ETICS the next steps are:
Finalise composition of WN. Which
includes the definition of “extended platform” (i.e. OS + extras + externals).
Analyse build, create full
list of issues. Most already known because SL5 is similar to CentOS 5, maybe
issues with externals.
Apply all necessary updates
to upstream codebase and official build definition. Iterate until a we have a
6.2 Testing the WN Target
- Backward compatibility
- Doc preparation
Release to PPS
VO testing on PPS. Assuming active participation by users
Finished Sept 19th
Production release of WN and move onto job submission tools for UI
Ph.Charpentier asked whether bug fixes and features will be implemented on both SL4 and SL5 during the transition period.
O.Keeble replied that after 2 months of “demonstrated SL5 reliability” the SL4 version moves to “security updates only” mode.
Ph.Charpentier added that the change would happen when the Sites and Experiments are in data taking and therefore not in favour of any changes in August and September.
T.Cass replied that the migration of the CERN facilities would take place only at the beginning of 2009, during the LHC shutdown. And this timescale was agreed at the MB in January 2008.
Ph.Charpentier then asked which compiler is used the porting to SL5? The Application Area is using gcc 4.3.
O.Keeble replied that the native SL5 compiler will be used (gcc 4.1). For the Applications Area there will be a build of the WN packages with gcc 4.3.
L.Dell’Agnello asked about the strategy on 32 vs. 64 bits.
O.Keeble replied that the initial porting will be to 32 bits and even if running on a 64 bits WN it will be running with 32 bits libraries. Both 32 and 64 bits libraries will be available.
8. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.
31 May 2008 - Sites and Experiments
should complete the Operations Alarms Page
1 June 2008 - J.Templon will summarize the proposal of the interim solution for treating Operations Alarms until a GGUS-based solution is available.