LCG Management Board |
|||||||||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
24 June 2008, 16:00-17:00 |
||||||||||||||||||||||||||||||||||||||
Agenda
|
|||||||||||||||||||||||||||||||||||||||
Members |
|||||||||||||||||||||||||||||||||||||||
|
(Version 1 - 27.6.2008) |
||||||||||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), D.Barberis, I.Bird (chair), T. Cass, J.Casey, Ph.Charpentier, D.Collados,
L.Dell’Agnello, A.Di Meglio, F. Donno, M.Ernst, I. Fisk, S.Foffano,
F.Giacomini, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, U.Marconi, A.Pace,
R.Pordes, Di Qing, Y.Schutz, J.Shiers, R.Tafirout |
||||||||||||||||||||||||||||||||||||||
Action
List |
|||||||||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
||||||||||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
8 July 2008 16:00-17:00 – F2F Meeting |
||||||||||||||||||||||||||||||||||||||
1.
Minutes and Matters arising (Minutes)
|
|||||||||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
The
minutes of the previous MB meeting were approved. 1.2 Follow-up to the LHCb Request of Using Pilot JobsAs
agreed during the previous MB meeting, Ph.Charpentier distributed a note
about LHCb multi-user pilot jobs. Here is the text:
J.Gordon asked where LHCb plans
to run their pilot jobs. Ph.Charpentier replied that LHCb
plans to run pilots jobs on all their Tier-1 Sites. J.Gordon reminded that the
security “Multi-User Pilot Jobs” document should be approved by the MB in the
next few weeks, as requested by the JSPG. Sites and Experiments agreed on the
policy and should follow it. I.Bird replied that this request
is for 2 weeks in July only, and not a permanent change of policy. LHCb asked
to do some time-limited testing. Ph.Charpentier noted that the
request is for the Tier-1 Sites only, but all the Tier-1 Sites are needed for
the test. He pointed out that one (or two) other VOs are already doing what
LHCb is asking for and nobody is actually complaining. A.Heiss for FZK added that only
few German sites are running LHCb jobs and therefore it should not be a major
issue. F.Hernandez agreed that for the
IN2P3 Tier-1 there is no problem. And also for activities on the French
Tier-2 sites it is likely to be accepted (but he will have to check and
confirm). L.Dell’Agnello stated that also
for the CNAF Tier-1 and for the Italian Tier-2 he will have to verify the
possibility. He will mail the answer to the MB list. I.Bird recommended that, also
considering how LHCb has always provided all information requested, the WLCG
MB should accept their request and the issue be solved before next week. New Action: 1 July 2008 - The MB
should decide on the LHCb’s request to use multi-user pilot jobs for testing
DIRAC in July. |
|||||||||||||||||||||||||||||||||||||||
2.
Action List Review (List of actions)
|
|||||||||||||||||||||||||||||||||||||||
Will be discussed at the LHCC mini review
next week.
On going. It will be installed on the
pre-production test bed PPS at CERN and LHCb will test it. Other sites that
want to install it up should confirm it.
The only information still missing is:
CMS list of 4 users DNs that can post alarms to the sites’ email address.
Ongoing. H.Marten reported that
follow-up to the previous MB meeting, about ALICE accounting, has started.
Asked to J.Casey but information not received
yet.
|
|||||||||||||||||||||||||||||||||||||||
3.
LCG Services
Weekly Report (Slides)
- J.Shiers
|
|||||||||||||||||||||||||||||||||||||||
J.Shiers presented a summary of the LCG Services activities. Here is the link to the minute of last meeting: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek080616 Below are the versions of the MSS software. In bold are those released, in normal text those not yet release.
3.1 Sites IssuesA few sites seem to have problems with power and cooling. There were two incidents this week; as agreed during CCRC a post-mortem should be produced every time.
L.Dell’Agnello
added that since Sunday the power is back and the systems are being powered up.
Actually CNAF took the occasion to do a planned update of some equipment. The
resources should be all back now and the UPS will be in place soon. J.Shiers
asked that the sites produce a short report to the MB in case of such
incidents.
J.Gordon
clarified that so many jobs cause a time out on the Information provider. It
is not a problem due to the RAL batch system because it happened also at
other sites. 3.2 LCG Services IssuesDatabase Services - (see slide 5 for details) there were some problems with the cabling of the RAC6 Ethernet switches and with GridView. In addition the SRM for different Experiments were moved to different RACs. In order to isolate the Castor SRM service from potential problems on the Castor stager backend, the change will consist in lowering the stager timeout. This should prevent exhaustion of the SRM thread pool that was observed during CCRC. GridView - some instability of service wrt Oracle cluster-ware. DBAs recreated the
service definition and the situation looks better. Opened service request
with Oracle. There are some incredibly long transactions and a review of the DB
schema has been launched. Monitoring /
dashboard report: Two issues after the GridView upgrade -
the TNS
was slightly out of date so the services had to be restarted -
A query
has stopped working and will be fixed tomorrow so there will be a few hours
when the summary data for availability will not be updated. -
CMS dashboard – performance / load problems. Working with
DBAs to identify slow / heavy queries. A specific query felt to have been the
cause of some such problems is now using an index and runs much faster. A
review has been proposed by the DB team and should be scheduled asap. 3.3 Other ServicesVOMS service issues - FIO do not have a procedure since it is
basically impossible to measure when the service is not available since there
is no monitoring agent that is a member of the VO. And it seems to be hard to
work out externally without being in the VO if it is working. # threads for LHCb LFC - increased to 60 taking advantage of DB
intervention LFC - VOBOXes of LHCb - to be recognized as host certificate by
the LFC, a VOBOX DN must finish with CN=myhostname or CN=host/myhostname.
So, the host certificate - just renewed - for VOBOX for LHCb at NIKHEF is
incorrect. 3.4 Experiments IssuesCMS – fairly quiet week. Transfer problems to FZK reported Tuesday, CMS
SAM SE testing (CASTOR@CERN) failing with write permission denied (Friday),
summary of CRUZET-2 (ended last Sunday) reported Wednesday. (100% job
success at Tier0, some issues in RECO step – unsuppressed zeroes in ECAL
giving very large reconstructed files. iCSA08 production finished for all
workflows, cleaning up un-needed datasets. HI production runs at Tier1s. Some
srm-cms ‘misbehaviour’ seen well before announced intervention time. All problems
disappeared afterwards but it was not clear what happened. ALICE – only occasional FTS testing – not enough Tier1 disk space to fully
stress sending custodial raw during commissioning exercise. First pass ESD at
CERN for transmission to Tier1s for eventual reprocessing LHCb – restart CCRC activities with ‘intermediate’ version of DIRAC.
NIKHEF & IN2P3 to support dcap instead of gsidcap. More info in elog LHCb observation F.Hernandez
added that the change requested influences also CMS. IN2P3 would like to
understand the reasons of why gsidcap is not working properly. They are
preparing some tests. After the tests they are ready to move to dcap, if it
is necessary. ATLAS – NDGF migrating from RLS to LFC. (Wed). Clarification of naming scheme to be
used from now on – data deleted after 8 days: Naming convention for the ATLAS
Functional Test datasets: ccrc08_run2.0XXYYY.physics_blabla; where XX is the
week number, this week (started Monday at 0:00 am) is 25, and YYY is
sequential. This is just to give the information that we are going, starting
tomorrow, to delete the ccrc08_run2.024yyy.blabla data. For the sites this
operation is transparent. J.Gordon
asked that in the LHCb QR the problem with the CA is not related to RAL: This
is a problem that depends on the procedures and on the way certificates can
be changed in the middleware. This will happen to other sites because there
is no alternative method. |
|||||||||||||||||||||||||||||||||||||||
4. Preparation of the LHCC Review (HL Milestones) - I.Bird |
|||||||||||||||||||||||||||||||||||||||
I.Bird asked for a final verification of some HL Milestones before the LHCC Review of next week. He asked that all sites reply to J.Gordon’s questionnaire clarifying exactly the percentage of pledges installed and when will be completed. 24x7 Support and VO Boxes Support - the milestones are late and not progressing since a long time. FZK
(A.Heiss) and IN2P3 (F.Hernandez) reported that they had administrative
procedures to fulfil and could not be done sooner. T.Cass
reported that for CERN the procedures are implemented (i.e. 07-05 green) but
the Experiment have not formally approved it (i.e. 07-05b still red). The Definition of the CAF by the Experiments should be completed. ATLAS
(D.Barberis) and LHCb (Ph.Charpentier) declared the CAF definition completed
for their Experiments. I.Fisk
said that CMS is discussing it in their workshop this week, and will define
it in the next few weeks. This is the dashboard, updated after the meeting: |
|||||||||||||||||||||||||||||||||||||||
5. Status of OSG RSV test and Equivalence to SAM (RSV_Crit_Tests; WLCG-OSG) - R.Pordes, D.Collados |
|||||||||||||||||||||||||||||||||||||||
D.Collados proposed a set of critical tests for OSG (RSV_Crit_Tests) The existing critical tests are described at: https://twiki.cern.ch/twiki/bin/view/LCG/OSGCriticalProbes 5.1 OSG CE TestsThe list of critical tests (all OSG CE tests) is: - org.osg.certificates.crl-expiry (check if CRLs are still valid) - org.osg.general.osg-directories-CE-permissions - org.osg.general.osg-version (version of OSG CE running) - org.osg.globus.gridftp-simple (a test file is gridftp'ed to remote host, and back. Then, files are checked for consistency) The proposed list of critical tests for OSG-CEs has 4 new tests and one removed: - org.osg.certificates.crl-expiry(check if CRLs are still valid) - org.osg.general.osg-directories-CE-permissions - org.osg.general.osg-version - org.osg.certificates.cacert-expiry: check if CA certs are still valid - org.osg.general.ping-host: check if CE responds to pings - org.osg.globus.gram-authentication: authenticate to remote CE using GSI - org.osg.batch.jobmanager-default-status: check status of default job manager) .This one only available early July. These are equivalent to EGEE tests in terms of functionality. But will be executed from the WN in next OSG releases (using jobmanager-fork at present time). F.Hernandez
asked whether the EGEE test checking that the sites accept the CA recommended
is missing. J.Casey
agreed that this test is missing and will check it and report to the MB. New action: J.Casey to report whether the EGEE test checking that the
sites accept the recommended CA is critical and if so implement it for OSG. 5.2 OSG Gridftp TestsProposed list of
OSG-Gridftp critical tests: -
org.osg.globus.gridftp-simple: a test file is gridftp'ed to remote host,
and back. Then, files are checked for consistency. Is equivalent to EGEE ‘CE-sft-lcg-rm-cr
& CE-sft-lcg-rm-cp’ tests. Will be even more equivalent to EGEE
‘SRMv2-lcg-cp (lcg-cp --nobdii)’ that will be out in the coming weeks. In
this case the OSG test is in advance on the existing EGEE’s one. 5.3 OSG SRM TestsProposed list of OSG-SRMv1/v2 critical tests: -
org.osg.srm.srmping: check if SRM server is responding to
srm-pings -
org.osg.srm.srmcp-readwrite (srmcp
a file to and from a storage element using the srm protocol) These are equivalent to
EGEE SRMv1 tests (SRM-put, SRM-get, SRM-advisory-delete) 5.4 Previously Found IssuesThere is now a single source of information for OSG resources (OIM list) http://oim.grid.iu.edu/publisher/get_osg_interop_monitoring_list.php and all OSG services are now published and automatically stored in SAM Database. The GridView database is ready to accept OSG downtimes in production mode. End-to-end tests are successful. I.Bird concluded that, provided that the verification about
the CA tests (mentioned above) is positive, the OSG RSV tests should be
considered equivalent to the existing SAM tests. 5.5 Status of the OSG RSV MilestonesR.Pordes reported the status of the OSG RSV milestones present in the HL Milestones dashboard. WLCG-08-01b Jun 2008 RSV: Tier-2 SE Tests Equivalent to SAM; Successful WLCG verification of OSG test equivalence of RSV tests to WLCG SE tests - Work completed pending approval by the WLCG board? WLCG- 08-02 Jun 2008: OSG Tier-2 Reliability Reported: OSG RSV information published in SAM and GOCDB databases. - Work completed pending approval by the WLCG board, from RSV through to GridView. Reliability reports include OSG Tier-2 sites. - Excel output from GridView being checked by OSG against local RSV reports; it is being reviewed by US ATLAS and US CMS management. I.Bird
asked whether this validation of GridView is only for June. R.Pordes
replied that is only a one-time verification; in order to replace the N/A
with real reliability data in the future. OSG also requested that: - The monthly WLCG report (PDF file) from GridView with OSG Tier-2s is distributed to OSG as early as possible at the beginning of the month. They will verify it and reply whether it can be included in the official June reports. - The OSG reliability data are needed also inside OSG. They have asked to GridView an API or an XLM feed to retrieve the reliability data online. |
|||||||||||||||||||||||||||||||||||||||
6.
Dynamic
Megatable Proposal (Slides)
- F.Donno
|
|||||||||||||||||||||||||||||||||||||||
Postponed to next meeting (F2F Meeting, 8 July 2008). |
|||||||||||||||||||||||||||||||||||||||
7.
AOB
|
|||||||||||||||||||||||||||||||||||||||
No AOB. |
|||||||||||||||||||||||||||||||||||||||
Summary of New Actions
|
|||||||||||||||||||||||||||||||||||||||
New Action: 1 July 2008 - The MB should decide on the LHCb’s request to
use multi-user pilot jobs for testing DIRAC in July. New action: J.Casey to report whether the EGEE test checking that the
sites accept the CA recommended is critical and if so implement it for OSG.. |