LCG Management Board
Tuesday 19 August 2008, 16:00-17:00 - Phone Meeting
(Version 1 - 26.8.2008)
A.Aimar (notes), D.Barberis, O.Barring, L.Betev, I.Bird(chair), Ph.Charpentier, M.Ernst, I. Fisk, S.Foffano, F.Giacomini, J.Gordon, A.Heiss, M.Kasemann, D.Kelsey, U.Marconi, M.Lamanna, G.Merino, A.Pace, B.Panzer, R.Pordes, Di Qing, H.Renshall, M.Schulz, O.Smirnova, J.Templon
Mailing List Archive
Tuesday 2 September 2008 16:00-17:00 – Phone Meeting
Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
About the minutes of the previous meeting M.Ernst noted that in section 3.2:
- The trial change of the BNL site name was taken back but it propagated without BNL knowing.
- Currently the sites have to manually refresh their FTS cache and would be desirable that this is automated.
The minutes of the previous MB meeting were approved.
M.Lamanna sent the links for the ALICE,ATLAS and LHCb implementations of the SAM Dashboard.
Action List Review (List of actions)
On going. It is installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it should announce it.
J.Templon reported that
the SCAS server seems to be ready and “certifiable” in a week. The client is
M.Schulz added that the reasonable frequency for a site like CERN is of 10 request/sec. Currently it reaches about 4-5 requests/sec.
To be discussed at the MB.
These actions above should be discussed with M.Schulz and J.Shiers both present.
- A document describing the shares wanted by ATLAS
- Selected sites should deploy it and someone should follow it up.
- Someone from the Operations team must be nominated follow these deployments end-to-end
D.Barberis mailed a status summary (by S.Campana and M.Lamanna) before the meeting.
J.Templon stated that NL-T1 is also willing to participate to these tests.
The MB agreed that this action will be verified again at next MB meeting in 2 weeks.
2.1 New Actions Proposed
J.Templon added that an action about converting requirements and pledges to new CPU unit proposed from the HEP Benchmarking group.
J.Gordon replied that the issue was going to be discussed at the GDB.
Ph.Charpentier added that the list of packages and the upgrade procedures for the Worker Node should be discussed before and then agreed at the GDB.
A mail from O.Keeble contained the link to the proposed procedures. https://twiki.cern.ch/twiki/bin/view/EGEE/ClientDistributionProposal
M.Barroso is collecting information from the site managers.
Weekly Report (Slides)
H.Renshall presented a summary of status and progress of the LCG Operations. This report covers the last two weeks.
The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
Not all sites participate to the meetings. Systematic remote participation by BNL, RAL and PIC (and increasingly NL-T1). The other Sites are encouraged to participate more as holiday season ends and LHC start up approaches.
3.1 Sites Reports
Following the migration of the VOMS registration service to new hardware on 4 August new registrations were not synchronised properly until late on 5 August due to a configuration error (the synchronisation was pointing to a test instance). For two days there were no registrations recorded, but only one dteam user was affected however.
The CASTOR upgrades to level 2.1.7-14 were completed for all the LHC experiments.
Installation of the Oracle July critical patch update has been completed in most production databases with only minor problems. The upgrade was transparent to the end users.
Many separate problems/activities have lead to periods of non-participation in Experiment testing (migration dCache to CASTOR, upgrade to CASTOR 2.1.7). Some delays due to experts being on vacation (SRM). Several disk full problems (stager LSF and a backend database).
J.Gordon explained that the “disk full problem” was due to the ATLAS instance of CASTOR generating an unsustainable number of queries and a huge growth of the log files that had not been foreseen. It seems due to some retries of CASTOR and was reported to the developers.
The long standing problem of unsuccessful automatic failover of the primary network links from BNL to TRIUMF and PIC to the secondary route over the OPN is thought to be understood and resolved. CERN will participate in some further testing.
On 14 August found user analysis jobs doing excessive metadata lookups caused PNFS slowdown causing SRM put/get timeouts and hence low transfer efficiencies. This could be a potential problem at many dCache sites.
On Friday 8 August PIC reported having overload problems of their PBS master node with timing out errors affecting the stability of their batch system. The node had been replaced on Tuesday so they decided to revert to the previous master node - done Thursday evening - but this also showed the same behaviour. They suspected issues in the configuration of the CE that is forwarding jobs to PBS and scheduled a 2-hour downtime on the following Monday morning to investigate.
There was a problem with LFC replication to GridKa for LHCb from 31-7 till 6-8. Replication stopped shortly after the upgrade to Oracle 10.2.0.4. Propagation from CERN to GridKa would end with an error (TNS error of connection dropped). At the same time GridKa DBAs reported several software crashes on cluster node 1 and since 6-8 we are using GridKa cluster node N.2 as destination. Further investigations are being performed by GridKa DBAs to see whether the network/communication issues with node N.1 are still present.
LFC 1.6.11 - which fixes the periodic LFC crashes - has passed through the PPS and will be released to production today or tomorrow.
Received a partial fix to the VDT limitation of 10 proxy delegations that makes WMS proxy renewal again useable by LHCb and should help ALICE. But is not the final solution.
Ph.Charpentier asked whether the VDT patch has been deployed.
H.Renshall will report on the deployment of the patch outside the meeting.
F.Giacomini reported that the problem about the proxy mix-up has been understood but not fixed yet.
Ph.Charpentier noted that this is a real issue because users cannot submit jobs using different roles.
3.2 Experiments Reports
Nothing special to report.
Have published to EGEE the requirement on sites to support multi-user pilot jobs – basically to support a new role=pilot in the Yaim configuration file.
Several CERN VOBoxes suffered occasional (each few days) kernel panic crashes with a known kernel issue. Now fixed with this week’s kernel upgrades.
An issue has arisen (ongoing at NL-T1) over the use of pool accounts for the software group management function. These require the VO software directory to be group writeable. If the sgm pool accounts are a different Unix groups to the other LHCb accounts they would also have to be world read. Currently affecting some SAM tests.
J.Templon noted that the solution implies world readable directories and many site administrators have concerns about it. Some VOs could have the same concern for their applications, configuration and data files visible world-wide.
Ph.Charpentier noted that the issue is there with many sites and for LHCb is not an issue. I reminded that, while NL-T1 has added some scripts to fix these issue, it should instead should be fixed in the general pool account solution.
J.Templon proposed that this issue is discussed at next GDB meeting.
During the last two weeks continued the pattern of a 1.5 day global cosmics run Wednesday to Thursday.
Failure of network switch to building 613 at 17.00 on 12 August stopped AFS-cron service which stopped some SAM test submission and monitoring information going into SLS. Not fixed till 10.00 the next day.
Have been preparing for their CRUZET4 cosmics run which in fact is not ZEro Tesla but magnet on. The run started 18 August and will last a week. Requested for this a new CERN elog instance which exposed a dependency in this service on a single staff member currently on holiday.
Preparing 21-hour computing support shifts coordinated across CERN and FNAL. Remaining 3-hours period to be covered by ASGC.
Over weekend of 9/10 August ran one 12-hours cosmics run with only one terminating luminosity block. This splits into 16 streams of which 5-6 were particularly big resulting in unusually high rates to the receiving sites (with success).
AFS-cron service failure at 17.00 on 12 August stopped functional tests till 10.00 on the next day, due to an unexpected dependency.
During weekend of 16/17 August changed embedded cosmics data type name in raw data files to magnet-on without warning sites resulting in some data going into the wrong directories.
Now performing 12-hour throughput tests (full nominal rate) each Thursday from 10.00 to 22.00. Typically a few hours to ramp up and down to/from the full rates. Outside of this activity ATLAS is running functional tests at 10% nominal rate and cosmics at weekends and as/when scheduled by detector teams.
ATLAS will start 24 by 7 computing support shifts at end August.
4. Approval of security policy documents (More Information) - D.Kelsey
D.Kelsey, representing the JSPG, asked for the approval of the four documents below:
1. Virtual Organisation Operations Policy (See Version 1.6) https://edms.cern.ch/document/853968
2. Grid Security Traceability and Logging Policy (See Version 1.9) https://edms.cern.ch/document/428037
3. Approval of Certification Authorities (See Version 2.8) https://edms.cern.ch/document/428038
4. Policy on Grid Multi-User Pilot Jobs (See Version 0.6) https://edms.cern.ch/document/855383
I.Bird noted that the only comments received were from OSG and he asked R.Pordes to present them to the MB.
R.Pordes explained that the current practices in OSG are consistent with the documents above, except for the traceability and logging policies.
For the other documents the OSG expects to be soon consistent with the JSPG proposals.
Concerning identity switching (for the multi-user pilot jobs) R.Pordes added that as ATLAS and CMS approved it at the MB once they request it to the OSG facilities representatives the sites will have to implement it.
D.Kelsey noted that the document does not requires to run multi-user pilot jobs, it only specifies that, once VOs and sites agree on it, the only way to implement it is the one specified in the document No 4 above.
I.Fisk added that the policies are correct also for OSG. When implemented it would follow the agreed guidelines. But OSG cannot implement the traceability and logging immediately.
D.Kelsey confirmed that the compliance does not have to be immediate but planned.
The MB approved the proposed policies. The sites not fully complying should provide a schedule by when they will achieve compliance.
5. End User Analysis -B.Panzer
B.Panzer reported that the working group on End User Analysis is reviewing the document he had distributed which includes scenarios, architecture, boundary conditions at CERN.
About half-comments were received and are being discussed. They are on a wiki page but not distributed until he is sure they represent the view of the Experiments (in particular the comments for CMS).
M.Kasemann confirmed that the wiki page can be distributed to the MB. No comments from the other Experiments.
B.Panzer will distribute the address of the wiki page with the proposal and comment received about End user Analysis.
I.Bird proposed that the working group reports on the F2F on the 9 September 2008.
6.1 LHC Grid Fest
I.Bird presented the upcoming LHC Grid Fest event. The document above is the schedule of the event.
This event is outside the MB scope but MB members could be involved.
The Slides show the logo of the LHC Grid Fest (slide 1).
6.2 New WLCG Logo
This is also the only opportunity to change the WLCG logo because it will then be included in all brochures and handout given to the press and media.
Slide 2 contains several proposal of a new WLCG logo.
I.Bird asked the opinion of the MB about changing the logo or stay with the current one forever.
The MB discussed briefly about the implications of changing the logo and decided NOT to change the current WLCG logo.
1.1 GDB for Tier-2 Sites - J.Gordon
J.Gordon reported that many Tier-2 are asking for advice about hardware to purchase but this may be outside the GDB mandate or can be a specific topic in the future. But there are obvious topics which are different in substance or scale for Tier-2 than the Tier-1s on which much discussion has been focussed at MB and GDB:
- Communication to many more sites/people
- User support for the bulk of users, not the experts.
- Middleware - which versions and deployment methods
- Monitoring to improve reliability: has been covered a lot recently but perhaps an example from a Tier2 with remote monitoring and alarming.
Experiments could present their model for the usage and configuration for their Tier-2 sites and discuss possible issues.
- Do the experiments interfere with each other at sites which support more than one? Either in middleware, support paths or middleware versions?
D.Barberis noted that the 10 September is not a good date because there is the start-up of the beam and the Grid Fest.
I.Bird and J.Gordon proposed to move the GDB meeting about Tier-2 sites to October.
2. Summary of New Actions
B.Panzer will distribute the address of the wiki page with the proposal and comments received about End User Analysis.
Converting Experiments requirements and Sites pledges to new CPU units.
Agree on the software distribution and update at the Tier-1 Sites.