Minutes from the WLCG SCM Status Meeting
Instructions for the minutes taker.
- Update this page with minutes encapsulating them START/ENDLCGSCM tags like below.
- Visit LcgScmStatus and press the PDF button and save file with the naming schema as the existing files at foot of that page.
- Attach the PDF selecting to add a link for the page. This link will appear at the foot of LcgScmStatus. Do any formatting required to tidy up this auto addition.
- Within indico add a link to PDF on the agenda as the minutes.
Oct 7th 2009
Present
SteveT (minutes), Gavin, Jamie (chair), Maria, Andrea, Harry, Sophie, Riccardo, Antonio, Olaf, Eva, Miguel, Maarten, JanI
Introduction
- There has not been a meeting for a while partly because things are running smoothly
- Most of LHC at 2 or 3 degrees - something will happen soon.
LCG Quarterly Report
WLCG Service Reports
See above for individual reports.
Production Services
Production release , small but mega important for cream - 3.1 update 56 fixes two high risk vulnerabilities and some outstanding
bugs in
CREAM.
- One for accounting , records are now filled in the case of pbs_server/cream CE on split nodes.
- Other functional issues, mainly from CERN.
- New BLParser for LSF.
Future releases going to PPS tomorrow with corresponding release to production next week presumably. Moving now to new process
the staged rollout. First voms and glue 2.0 enabled BDII. Unclear if new VO-Box will go with old or new process.
- question
- What is the staged roll-out process?
- answer
- Releases go essentially to production first, a club of sites will be early adopters. Interested sites can take part.
- question
- Is the cream CE now as good as the lcg-ce?
- answer
- certainly one item is fixed but needs testing on the submission side. First release to consider production rather than special. There are now SAM tests in place.
Over night 8 cream CEs were updated with around 11 (that publish) to go
Current Status of cream CEs
.
- question
- Will there be a WMS release?
- answer
- There is minor release on the way.
- Maarten
- List of minimum versions will be revamped shortly.
- Olaf
- Following releases as they go is not always easy.
- Jamie
- Hopefully better in an SSC world.
- Olaf
- Yesterday's cream CE may also contain config changes that slows deployment.
- Antonio
- The Cream CE deployment is small and the current one is broken anyway. This also includes the rpath RPM fixes which allowed time for the changes already to have extensive certification.
- Jan
- Work around for current deployments is always appreciated.
- Antonio
- Security fixes do in other cases block addition of new features , e.g rpath fixes.
Data Managment
See Data Management Services above.
Castor
- Miguel
- Main new castor feature is throttling , particularly interesting to CMS and Atlas.
- Harry
- What is the latest date for this change?
- Miguel
- Feature not released, due for release first week of November.
- Jamie
- Late so should it wait till it shutdown at XMAS?
- Miguel
- We want to install on the non-LHC instances and be ready for LHC VOs if they demand it.
- Jamie
- Gap is widening with castor version from Tier0 to Tier1.
- Miguel
- Up to the Tier1s
- Jamie
- Does it mean that the old version will no longer supported.
FTS
SRM interaction does change hence the need for scale testing which is now underway.
- Jamie: ATLAS need new FTS with checksum support at all Tier1s?
- Gavin: Certainly atlas are interested to have it.
- Jamie: It seems difficult to imagine tier1s doing this before xmas, model was to run here for a month anyway with the new version.
LFC
CMS not using their prod LFC so will be merged into General LFC. CMS happy but needs some wider advertisement.
Database
See report above.
- Migration to RHEL4 despite validation on RHEL5 but problems found there.
Worklaod Management.
See report above.
- Two new aliases for slc4/5 lxplus. Encourage users to move in this direction.
- Harry
- Are there too many slc4 CEs over slc5 CEs?
- Riccardo
- Work in progress, news CE nodes are being deployed for slc5.
- Maarten
- Newer lcg-ces are better anyway from load point of view.
- Harry
- How is the doubling of WN capacity going? Will it be there in October?
- Olaf
- Is delayed at delivery at the moment but FIO is ready to deploy when it arrives.
Authentication
- See voms report above
- MyProxy - trying to remove myproxy-fts from production service.
- On the horizon the voms enabled version of myproxy.
AOB
How often do we meet? Every two weeks - usual clashes with GDB/F2F SSC/TMB.
4th November for sure and possibly in two weeks time.
May 27th 2009
Present
Nick (minutes), Jamie, Steve, Harry, Antonio, James, Maria D., Sophie, Gavin, Diana, Ricardo
Outstanding Issues & Actions
Other changes in the pipeline
LCG Service Review
Certification / Pre-production
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status).
- Jamie: Is there anything here that is needed for STEP 09?
Antonio: The ICE fix so that it works with the current production version of the CREAM CE would have been nice but we shouldn't rush the release of the gLite 3.2 WMS for that. The ICE fix should have been in a different patch, but it isn't.
Data Management
Castor/SRM
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Castor_SRM).
- Jamie: Do these updates really need to be applied before STEP 09?
Gavin: Not sure about that - but it's already in progress.
Jamie: We seem to be losing the period of stability before large scale excercises! Experience tells us that this is needed.
- Jamie: There is a switch intervention on 2nd June which could possibly affect all IT/DES managed databases. The intervention should be transparent.
FTS
Nothing to report.
LFC
Nothing to report.
Databases.
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Database_Services).
Workload Management
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Workload_Management).
- Harry: For information: Have been asked to give 2000 SL5 job slots to beam groups for 3 months.
- Steve: We should tell all other sites how we do the disabling of SELinux heap exec check.
- Antonio: There are patches about to arrive in the PPS for GFAL and lcg-utils which are to fix a problem only seen at CERN by LHCb. We have been asked to treat them as "top priority", which involves a lot of extra work. Once this is available, would it be installed immediately at CERN? Is CERN aware of significant problems caused in production services by the issue fixed by this patch?
Ricardo: Not aware of any big problems.
Authentication and Authorisation
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Authentication_and_Authorisation).
Monitoring
Nothing to report.
A.O.B.
- Harry: There is a Linux upgrade due on 15th June. This should be delayed if possible.
- Sophie: I had scheduled to upgrade the AFS UI on 2nd June. Should I postpone this?
Nick: Is this version already available in production labelled as the "New" version (as opposed to the Default)?
Sophie: Yes.
Nick: Then this should be delayed.
April 29th 2009
Present
Nick (Minutes), Jamie, Steve, Harry, Andrea, Olof, Antonio, Eva, James, Maria, Ewan
Outstanding Issues & Actions
Other changes in the pipeline
- STEP 09: No very firm dates yet. Setup in May - execution in June. Still striving to get ATLAS and CMS to carry out the same activities at the same time. Preferably including significant user analysis.
ATLAS’ plans for STEP 09 can be found here: https://twiki.cern.ch/twiki/bin/view/Atlas/Step09
Deployment Status
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status).
- DPM will be released to production within 2 weeks. It contains a bug but this only effects 3 sites.
- Jamie: What is critical for STEP 09?
Antonio: Only DPM.
Data Management
FTS
Nothing to report.
Castor/SRM
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Castor_SRM).
- Jamie: So this will be in place for STEP 09
Olof: Yes.
- Sophie: ATLAS want LFC with bulk methods deployed so they can test it, but we can't do this easily from the PPS repository.
Antonio: this will be available in a "preview" repository by the end of the week. Contents and documentation will look like a production release.
LFC
Nothing to report.
Databases.
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Database_Services).
- Steve: Automatic notifications for VOMRS problem have stopped - might already be fixed?
- Eva: Security updates will be rolled out in 2 weeks.
Workload Management
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Workload_Management).
- gLite update 44 broke ICE. There have been frequent interventions needed for the WMS. 3.2 WMS is supposed to fix lots of these problems, but we wait and see. It’s not likely to have the WMS in a steady and stable state before STEP 09.
Authentication and Authorisation
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Authentication_and_Authorisation).
Monitoring
See report (
https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Monitoring_Logging_and_Reporting).
- Getting increased e-log usage from CMS so they are discussing doing this for themselves.
- Jamie: How many sites are using “Site View”?
Tier-1s have been asked to comment but don't know of any other sites using this. Probably needs promoting. [Who will do this?]
A.O.B.
- Jamie: What is the typical number of piquet call-outs for DM? Gavin said he thought it was about 1 per week (including all hours). Is there a log anywhere?
Olof: Yes. Can also see out-of-hours through EDH (compensation requests).
Jamie: Would be interesting to monitor this through May and June.
- Harry: There has been a formal request from ATLAS to provide better CERN twiki support. There has also been a request from LHCb for better CVS support.
Jamie: These were in the original lists of “Critical Services” for the experiments but were not with high priority (10 or 11).
Harry: The IT SRM (Service Review Meeting) will probably look after these.
- Nick will update the Baseline Services wiki page (linked on the agenda) for STEP 09.
Next meeting will be May 20th
November 5th 2008
Present
Nick (Minutes), Antonio, Sophie, Olof, Miguel, Harry, Steve, Ulrich, Jan
Deployment Status
See report.
- Harry: Any new on when PPS expects to see the WMS ICE interface?
Antonio: A patch has been built but not yet released to certification.
- Harry: What about the list of criteria for replacing LCG CE with CREAM CE.
Nick: This has been sent out to a wide audience. Waiting a few days
for feedback then will submit to the EGEE TMB for approval.
Data Management
See report.
FTS
See report.
Castor/SRM
See report.
LFC
See report.
Databases.
See report.
Workload Management
See report.
- Nick: Does LHCb still need the RB?
Harry/Sophie: Will check. [ACTION]
Authentication and Authorisation
See report.
Monitoring
See report.
A.O.B.
Next meeting will be 19 November
October 1st 2008
Present
NickT (Minutes), Jamie, Harry, Gavin, Steve, Louis, Miguel, David, Sophie, Antonio
Deployment Status
See report.
- STOP PRESS::: Problem found last night in the SL4 FTS service running at CERN. log4ccp was seen to seg fault when error message is very large.
Restarting the daemon just causes it to try to process the error again and so seg faults again. Gavin will keep the CERN service running but this version
should not be released generally. Paolo is working on a fix.
UPDATE: gLite will likely patch the log4cpp and re-release it (since this bug will
potentially affect any other gLite software that makes use of the library).
Data Management
See report.
FTS
See report.
Castor/SRM
See report.
LFC
See report.
Databases.
See report.
- Jamie So CMS run the hardware.
Miguel Yes and we don't have a 24x7 contact at the pit to get someone to, for example, restart boxes there.
Harry Do you have anyone for LHCb?
Miguel Not at the pit.
Jamie Will write something generic to request proper contact
details.
Workload Management
See report.
Authentication and Authorisation
See report.
- NA48 are now fully sorted out.
- Will add updates to pilot service for US to be able to test.
- Strange problem seen (at a low level) where expiration of UK CA is causing rejection of random, non-UK users. Under investigation. May be related to
Oracle problem ORA-942.
Monitoring
See report.
A.O.B.
- No official update on LHC restart. Jamie has requested the MB to have an official communication channel.
- Very probably there will be a CCRC '09 - exact timing to be decided.
Next meeting will be in 2 weeks.
September 17th 2008
Present
NickT (Minutes), Jamie, Harry, Gavin, Olof, Jan, Steve, James, Miguel, Sophie, Antonio, Ulrich
Deployment Status
See report.
Data Management
See report.
FTS
- Nick
- Have we heard from BNL or Fermilab regarding them setting up an SL4 instance of FTS?
- Gavin
- Not yet. Will contact them again.
Castor/SRM
- Jamie
- When can back end move be done for Castor?
- Jan
- Next week.
- Jamie
- Could it be made this week?
- Jan
- Yes, we can schedule this.
LFC
- Sophie
- RAL reported to the developers that 1.6.11-3 is still crashing the LFC daemons. CERN will stay with LFC 1.6.8 until the problem is understood.
Databases.
See report.
- Jamie
- How do we get Oracle to react on this?
- Miguel
- Met with them yesterday, but we are having significant issues with Oracle support.
- Jamie
- What about the problem with load for ATLAS?
- Miguel/Harry
- This is a very significant problem. It's architectural so difficult to fix.
Workload Management
See report.
- Complaint from "a certain experiment" that very few jobs were running at CERN. This was checked and found to be a misconfiguration in the fair-share
set up. Fixed and now being monitored.
Authentication and Authorisation
See report.
- Olof
- Is NA48 still running their own VOMS server?
- Steve
- Their VOMS server is now out of action and they are being hosted on the central VOMS service at CERN. (99.8% ready)
Monitoring
See report.
A.O.B.
- Jamie
- What is the timeline for the analysis instance of Castor?
- Olof
- Being discussed today. Would prefer to start with 2.1.7 (for safety).
- Jamie
- Shared instance for all experiments and xroot is the access protocol?
- Olof
- Yes. But need the xrootd backported to 2.1.7.
Next meeting will be in two weeks.
September 3rd 2008
Present
SteveTraylen (Minutes), Ulrich, Harry, Louis, Gavin, Olaf, Eva (?) , DavidC , Jamie, Markus, Antonio, Roberto, Maria, Dianna, Judit, NickT
Deployment Status
Recent gLite release, main thing is fix for
BUG:37563
concerning proxy renewal chain length.
- Jamie
- Will this be recommended for deployment?
- Markus and Antonio
- EGEE never specifies a policy on deploying releases. Up to sites with guidance from LCG.
- Jamie
- Follow up later.
Data Management
See report.
- Jamie
- Are we susceptible to these problems with the expected increased in load in the immediate future.
- Jamie
- Who is pushing for FTS SL4.
- Harry & Nick
- Brookhaven and Fermilab want SL4 for security purposes. Other tier1s mainly state that they do not want to run two instances.
- Gavin
- Would be better to hang on a few weeks.
- Jamie
- After next week we will have a better idea of where production is going. We should find out when there might be a weeks gap for beam tuning, etc.
- Markus
- Why would it be more work, surly it would be migrated via a parallel service anyway surly?
- NickT
- Agreement.
Databases.
See report.
Workload Management
See report.
Authentication and Authorisation
- Roberto
- Few jobs running?
- Ulrich
- Hit hard when larger Experiments wake up. Fair share policies.
- Jamie
- Is it is easy to check what the shares are?
- Ulrich
- Can be checked on lsfweb. https://cern.ch/service-lsfweb
Monitoring
Nothing.
AOB
None.
Next meeting will be in two weeks.
August 20th 2008
Attendees
Harry, Nick, Steve, Gavin, Antonio, Ricardo
Outstanding Issues & Actions
Anti-clockwise beam test will be carried out this weekend.
LCG Service Review
Certification / Pre-production
- See the report.
- LHCb said that they will test glexec in the PPS.
Data Management
- Nothing to add to the report.
Databases
Workload Management
- Nothing to add to the report.
Authentication & authorisation
- Nothing to add to the report.
Monitoring, Logging & Reporting
AOB
- For information: Request from ATLAS to double the disk space and CPU at CERN.
August 6th 2008
Attendees
Harry, Nick, Steve, Sophie, Ulrich, Miguel, Olof
Outstanding Issues & Actions
The WLCG Management Board (MB) are concerned that there may be too many major middleware changes in the pipeline between now and first injection (end of
August).
There may be restrictions imposed.
LCG Service Review
Certification / Pre-production
- See the report.
- Due to the imminent start-up of LHC, there is a significant chance that the CREAM CE will not make it into production in the immediate future.
Data Management
- See the report.
- When LFC 1.6.11 is released to production (should be not more than 2 weeks), Sophie will test it in the FIO pre-production area before releasing
installing it in production.
Databases
- Nothing to add to the report.
Workload Management
- See the report.
- When will LHCb be able to stop using LCG RBs and move to WMSs? The WMS bugs around limitation on proxy delegation and mix up of proxies need to be
fixed first. What about the readiness of Dirac3? Is it true that as long Dirac2 is in production, the RBs will be needed?
Authentication & authorisation
- Nothing to add to the report.
Monitoring, Logging & Reporting
AOB
None.
July 9th 2008
DM
Nothing to add to report.
Workload
- The vo.sixt.voms.ch was enabled on batch systems.
- monb001 (R-GMA) was decommissioned and has now been switched off.
- Two service incidents related to movements of license servers.
- Stale ip address in LSF against license server.
- Expected to be transparent , 3 license servers moved 1 by 1. Should have been okay.
- Under Investigation for next time.
LHC News
- Caverns human free from by 1st or 2nd week of August but end of August may be more likely.
--
SteveTraylen - 09 Jul 2008