Minutes from the WLCG SCM Status Meeting

Instructions for the minutes taker.

  • Update this page with minutes encapsulating them START/ENDLCGSCM tags like below.
  • Visit LcgScmStatus and press the PDF button and save file with the naming schema as the existing files at foot of that page.
  • Attach the PDF selecting to add a link for the page. This link will appear at the foot of LcgScmStatus. Do any formatting required to tidy up this auto addition.
  • Within indico add a link to PDF on the agenda as the minutes.

Oct 7th 2009

Present

SteveT (minutes), Gavin, Jamie (chair), Maria, Andrea, Harry, Sophie, Riccardo, Antonio, Olaf, Eva, Miguel, Maarten, JanI

Introduction

  • There has not been a meeting for a while partly because things are running smoothly
  • Most of LHC at 2 or 3 degrees - something will happen soon.

LCG Quarterly Report

WLCG Service Reports

See above for individual reports.

Production Services

Production release , small but mega important for cream - 3.1 update 56 fixes two high risk vulnerabilities and some outstanding bugs in CREAM.
  • One for accounting , records are now filled in the case of pbs_server/cream CE on split nodes.
  • Other functional issues, mainly from CERN.
  • New BLParser for LSF.

Future releases going to PPS tomorrow with corresponding release to production next week presumably. Moving now to new process the staged rollout. First voms and glue 2.0 enabled BDII. Unclear if new VO-Box will go with old or new process.

question
What is the staged roll-out process?
answer
Releases go essentially to production first, a club of sites will be early adopters. Interested sites can take part.
question
Is the cream CE now as good as the lcg-ce?
answer
certainly one item is fixed but needs testing on the submission side. First release to consider production rather than special. There are now SAM tests in place.

Over night 8 cream CEs were updated with around 11 (that publish) to go Current Status of cream CEs.

question
Will there be a WMS release?
answer
There is minor release on the way.
Maarten
List of minimum versions will be revamped shortly.

Olaf
Following releases as they go is not always easy.
Jamie
Hopefully better in an SSC world.
Olaf
Yesterday's cream CE may also contain config changes that slows deployment.
Antonio
The Cream CE deployment is small and the current one is broken anyway. This also includes the rpath RPM fixes which allowed time for the changes already to have extensive certification.
Jan
Work around for current deployments is always appreciated.
Antonio
Security fixes do in other cases block addition of new features , e.g rpath fixes.

Data Managment

See Data Management Services above.

Castor

Miguel
Main new castor feature is throttling , particularly interesting to CMS and Atlas.
Harry
What is the latest date for this change?
Miguel
Feature not released, due for release first week of November.
Jamie
Late so should it wait till it shutdown at XMAS?
Miguel
We want to install on the non-LHC instances and be ready for LHC VOs if they demand it.
Jamie
Gap is widening with castor version from Tier0 to Tier1.
Miguel
Up to the Tier1s
Jamie
Does it mean that the old version will no longer supported.

FTS

SRM interaction does change hence the need for scale testing which is now underway.
  • Jamie: ATLAS need new FTS with checksum support at all Tier1s?
  • Gavin: Certainly atlas are interested to have it.
  • Jamie: It seems difficult to imagine tier1s doing this before xmas, model was to run here for a month anyway with the new version.

LFC

CMS not using their prod LFC so will be merged into General LFC. CMS happy but needs some wider advertisement.

Database

See report above.
  • Migration to RHEL4 despite validation on RHEL5 but problems found there.

Worklaod Management.

See report above.
  • Two new aliases for slc4/5 lxplus. Encourage users to move in this direction.
Harry
Are there too many slc4 CEs over slc5 CEs?
Riccardo
Work in progress, news CE nodes are being deployed for slc5.
Maarten
Newer lcg-ces are better anyway from load point of view.
Harry
How is the doubling of WN capacity going? Will it be there in October?
Olaf
Is delayed at delivery at the moment but FIO is ready to deploy when it arrives.

Authentication

  • See voms report above
  • MyProxy - trying to remove myproxy-fts from production service.
  • On the horizon the voms enabled version of myproxy.

AOB

How often do we meet? Every two weeks - usual clashes with GDB/F2F SSC/TMB. 4th November for sure and possibly in two weeks time.

May 27th 2009

Present

Nick (minutes), Jamie, Steve, Harry, Antonio, James, Maria D., Sophie, Gavin, Diana, Ricardo

Outstanding Issues & Actions

Other changes in the pipeline

  • Nothing to report.

LCG Service Review

Certification / Pre-production

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status).

  • Jamie: Is there anything here that is needed for STEP 09?
    Antonio: The ICE fix so that it works with the current production version of the CREAM CE would have been nice but we shouldn't rush the release of the gLite 3.2 WMS for that. The ICE fix should have been in a different patch, but it isn't.

Data Management

Castor/SRM

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Castor_SRM).
  • Jamie: Do these updates really need to be applied before STEP 09?
    Gavin: Not sure about that - but it's already in progress.
    Jamie: We seem to be losing the period of stability before large scale excercises! Experience tells us that this is needed.

  • Jamie: There is a switch intervention on 2nd June which could possibly affect all IT/DES managed databases. The intervention should be transparent.

FTS

Nothing to report.

LFC

Nothing to report.

Databases.

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Database_Services).

Workload Management

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Workload_Management).

  • Harry: For information: Have been asked to give 2000 SL5 job slots to beam groups for 3 months.

  • Steve: We should tell all other sites how we do the disabling of SELinux heap exec check.

  • Antonio: There are patches about to arrive in the PPS for GFAL and lcg-utils which are to fix a problem only seen at CERN by LHCb. We have been asked to treat them as "top priority", which involves a lot of extra work. Once this is available, would it be installed immediately at CERN? Is CERN aware of significant problems caused in production services by the issue fixed by this patch?
    Ricardo: Not aware of any big problems.

Authentication and Authorisation

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Authentication_and_Authorisation).

Monitoring

Nothing to report.

A.O.B.

  • Harry: There is a Linux upgrade due on 15th June. This should be delayed if possible.

  • Sophie: I had scheduled to upgrade the AFS UI on 2nd June. Should I postpone this?
    Nick: Is this version already available in production labelled as the "New" version (as opposed to the Default)?
    Sophie: Yes.
    Nick: Then this should be delayed.




April 29th 2009

Present

Nick (Minutes), Jamie, Steve, Harry, Andrea, Olof, Antonio, Eva, James, Maria, Ewan

Outstanding Issues & Actions

Other changes in the pipeline
  • STEP 09: No very firm dates yet. Setup in May - execution in June. Still striving to get ATLAS and CMS to carry out the same activities at the same time. Preferably including significant user analysis.
    ATLAS’ plans for STEP 09 can be found here: https://twiki.cern.ch/twiki/bin/view/Atlas/Step09

Deployment Status

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Deployment_Status).

  • DPM will be released to production within 2 weeks. It contains a bug but this only effects 3 sites.

  • Jamie: What is critical for STEP 09?
    Antonio: Only DPM.

Data Management

FTS
Nothing to report.

Castor/SRM
See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Castor_SRM).

  • Jamie: So this will be in place for STEP 09
    Olof: Yes.

  • Sophie: ATLAS want LFC with bulk methods deployed so they can test it, but we can't do this easily from the PPS repository.
    Antonio: this will be available in a "preview" repository by the end of the week. Contents and documentation will look like a production release.

LFC
Nothing to report.

Databases.

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Database_Services).

  • Steve: Automatic notifications for VOMRS problem have stopped - might already be fixed?

  • Eva: Security updates will be rolled out in 2 weeks.

Workload Management

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Workload_Management).

  • gLite update 44 broke ICE. There have been frequent interventions needed for the WMS. 3.2 WMS is supposed to fix lots of these problems, but we wait and see. It’s not likely to have the WMS in a steady and stable state before STEP 09.

Authentication and Authorisation

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Authentication_and_Authorisation).

Monitoring

See report (https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatus#Monitoring_Logging_and_Reporting).

  • Getting increased e-log usage from CMS so they are discussing doing this for themselves.
  • Jamie: How many sites are using “Site View”?
    Tier-1s have been asked to comment but don't know of any other sites using this. Probably needs promoting. [Who will do this?]

A.O.B.

  • Jamie: What is the typical number of piquet call-outs for DM? Gavin said he thought it was about 1 per week (including all hours). Is there a log anywhere?
    Olof: Yes. Can also see out-of-hours through EDH (compensation requests).
    Jamie: Would be interesting to monitor this through May and June.

  • Harry: There has been a formal request from ATLAS to provide better CERN twiki support. There has also been a request from LHCb for better CVS support.
    Jamie: These were in the original lists of “Critical Services” for the experiments but were not with high priority (10 or 11).
    Harry: The IT SRM (Service Review Meeting) will probably look after these.

  • Nick will update the Baseline Services wiki page (linked on the agenda) for STEP 09.

Next meeting will be May 20th




November 5th 2008

Present

Nick (Minutes), Antonio, Sophie, Olof, Miguel, Harry, Steve, Ulrich, Jan

Deployment Status

See report.
  • Harry: Any new on when PPS expects to see the WMS ICE interface?
    Antonio: A patch has been built but not yet released to certification.
  • Harry: What about the list of criteria for replacing LCG CE with CREAM CE.
    Nick: This has been sent out to a wide audience. Waiting a few days
for feedback then will submit to the EGEE TMB for approval.

Data Management

See report.

FTS
See report.

Castor/SRM
See report.

LFC
See report.

Databases.

See report.

Workload Management

See report.
  • Nick: Does LHCb still need the RB?
    Harry/Sophie: Will check. [ACTION]

Authentication and Authorisation

See report.

Monitoring

See report.

A.O.B.

Next meeting will be 19 November


October 1st 2008

Present

NickT (Minutes), Jamie, Harry, Gavin, Steve, Louis, Miguel, David, Sophie, Antonio

Deployment Status

See report.
  • STOP PRESS::: Problem found last night in the SL4 FTS service running at CERN. log4ccp was seen to seg fault when error message is very large.
Restarting the daemon just causes it to try to process the error again and so seg faults again. Gavin will keep the CERN service running but this version should not be released generally. Paolo is working on a fix.
UPDATE: gLite will likely patch the log4cpp and re-release it (since this bug will potentially affect any other gLite software that makes use of the library).

Data Management

See report.

FTS
See report.

Castor/SRM
See report.

LFC
See report.

Databases.

See report.
  • Jamie So CMS run the hardware.
    Miguel Yes and we don't have a 24x7 contact at the pit to get someone to, for example, restart boxes there.

Harry Do you have anyone for LHCb?
Miguel Not at the pit.
Jamie Will write something generic to request proper contact details.

Workload Management

See report.

Authentication and Authorisation

See report.
  • NA48 are now fully sorted out.
  • Will add updates to pilot service for US to be able to test.
  • Strange problem seen (at a low level) where expiration of UK CA is causing rejection of random, non-UK users. Under investigation. May be related to
Oracle problem ORA-942.

Monitoring

See report.

A.O.B.

  • No official update on LHC restart. Jamie has requested the MB to have an official communication channel.
  • Very probably there will be a CCRC '09 - exact timing to be decided.

Next meeting will be in 2 weeks.


September 17th 2008

Present

NickT (Minutes), Jamie, Harry, Gavin, Olof, Jan, Steve, James, Miguel, Sophie, Antonio, Ulrich

Deployment Status

See report.

Data Management

See report.

FTS
Nick
Have we heard from BNL or Fermilab regarding them setting up an SL4 instance of FTS?
Gavin
Not yet. Will contact them again.

Castor/SRM
Jamie
When can back end move be done for Castor?
Jan
Next week.
Jamie
Could it be made this week?
Jan
Yes, we can schedule this.

LFC
Sophie
RAL reported to the developers that 1.6.11-3 is still crashing the LFC daemons. CERN will stay with LFC 1.6.8 until the problem is understood.

Databases.

See report.
Jamie
How do we get Oracle to react on this?
Miguel
Met with them yesterday, but we are having significant issues with Oracle support.

Jamie
What about the problem with load for ATLAS?
Miguel/Harry
This is a very significant problem. It's architectural so difficult to fix.

Workload Management

See report.
  • Complaint from "a certain experiment" that very few jobs were running at CERN. This was checked and found to be a misconfiguration in the fair-share
set up. Fixed and now being monitored.

Authentication and Authorisation

See report.
Olof
Is NA48 still running their own VOMS server?
Steve
Their VOMS server is now out of action and they are being hosted on the central VOMS service at CERN. (99.8% ready)

Monitoring

See report.

A.O.B.

Jamie
What is the timeline for the analysis instance of Castor?
Olof
Being discussed today. Would prefer to start with 2.1.7 (for safety).

Jamie
Shared instance for all experiments and xroot is the access protocol?
Olof
Yes. But need the xrootd backported to 2.1.7.

Next meeting will be in two weeks.

September 3rd 2008

Present

SteveTraylen (Minutes), Ulrich, Harry, Louis, Gavin, Olaf, Eva (?) , DavidC , Jamie, Markus, Antonio, Roberto, Maria, Dianna, Judit, NickT

Deployment Status

Recent gLite release, main thing is fix for BUG:37563 concerning proxy renewal chain length.

Jamie
Will this be recommended for deployment?
Markus and Antonio
EGEE never specifies a policy on deploying releases. Up to sites with guidance from LCG.
Jamie
Follow up later.

Data Management

See report.
Jamie
Are we susceptible to these problems with the expected increased in load in the immediate future.
Jamie
Who is pushing for FTS SL4.
Harry & Nick
Brookhaven and Fermilab want SL4 for security purposes. Other tier1s mainly state that they do not want to run two instances.
Gavin
Would be better to hang on a few weeks.
Jamie
After next week we will have a better idea of where production is going. We should find out when there might be a weeks gap for beam tuning, etc.
Markus
Why would it be more work, surly it would be migrated via a parallel service anyway surly?
NickT
Agreement.

Databases.

See report.

Workload Management

See report.

Authentication and Authorisation

Roberto
Few jobs running?
Ulrich
Hit hard when larger Experiments wake up. Fair share policies.
Jamie
Is it is easy to check what the shares are?
Ulrich
Can be checked on lsfweb. https://cern.ch/service-lsfweb

Monitoring

Nothing.

AOB

None.

Next meeting will be in two weeks.

August 20th 2008

Attendees

Harry, Nick, Steve, Gavin, Antonio, Ricardo

Outstanding Issues & Actions

Anti-clockwise beam test will be carried out this weekend.

LCG Service Review

Certification / Pre-production

  • See the report.
  • LHCb said that they will test glexec in the PPS.

Data Management

  • Nothing to add to the report.

Databases

  • No report.

Workload Management

  • Nothing to add to the report.

Authentication & authorisation

  • Nothing to add to the report.

Monitoring, Logging & Reporting

  • No report.

AOB

  • For information: Request from ATLAS to double the disk space and CPU at CERN.

August 6th 2008

Attendees

Harry, Nick, Steve, Sophie, Ulrich, Miguel, Olof

Outstanding Issues & Actions

The WLCG Management Board (MB) are concerned that there may be too many major middleware changes in the pipeline between now and first injection (end of August). There may be restrictions imposed.

LCG Service Review

Certification / Pre-production

  • See the report.
  • Due to the imminent start-up of LHC, there is a significant chance that the CREAM CE will not make it into production in the immediate future.

Data Management

  • See the report.
  • When LFC 1.6.11 is released to production (should be not more than 2 weeks), Sophie will test it in the FIO pre-production area before releasing
installing it in production.

Databases

  • Nothing to add to the report.

Workload Management

  • See the report.
  • When will LHCb be able to stop using LCG RBs and move to WMSs? The WMS bugs around limitation on proxy delegation and mix up of proxies need to be
fixed first. What about the readiness of Dirac3? Is it true that as long Dirac2 is in production, the RBs will be needed?

Authentication & authorisation

  • Nothing to add to the report.

Monitoring, Logging & Reporting

  • No report.

AOB

None.

July 9th 2008

DM

Nothing to add to report.

Workload

  • The vo.sixt.voms.ch was enabled on batch systems.
  • monb001 (R-GMA) was decommissioned and has now been switched off.
  • Two service incidents related to movements of license servers.
    • Stale ip address in LSF against license server.
    • Expected to be transparent , 3 license servers moved 1 by 1. Should have been okay.
    • Under Investigation for next time.

LHC News

  • Caverns human free from by 1st or 2nd week of August but end of August may be more likely.

-- SteveTraylen - 09 Jul 2008

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2009-10-07 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback