WLCG-OSG-EGEE Ops' Minutes Mon 12 Jan 2009

Summary

No summary yet.

Attendance

EGEE

  • Asia Pacific ROC: Absent
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Maite Barroso, Diana Bosio
  • French ROC: Pierre Girard
  • German/Swiss ROC: Angela Poschlad, Wen Mei
  • Italian ROC: Absent
  • Northern Europe ROC: Vera Hasper
  • Russian ROC: Absent
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer
  • UK/Ireland ROC: Derek Ross, Jeremy Coles
  • GGUS: Guenter Grein

WLCG

  • WLCG Service Cordination: Harry Renshall

WLCG Tier 1 Sites

  • ASGC: Absent
  • BNL: Absent
  • CERN site: Ulrich
  • FNAL: Catalin Dumitrescu
  • FZK: Angela Poschlad
  • IN2P3: Pierre Girard
  • INFN: Absent
  • NDGF: Vera
  • PIC: Kai Neuffer
  • RAL: Gareth Smith
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

LHC Experiments

  • ATLAS: Alessandro di Girolamo
  • LHCb: Roberto Santinelli
  • CMS: absent
  • ALICE: Patricia

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC CERN ROC France
To ROC UKI ROC Italy

  • Candidate sites for suspension:

* Site Name: SN-UCAD (ROC France); GGUS Ticket number(s): 44443, 44987, 42668
https://gus.fzk.de/ws/ticket_search.php?ticket=44443
https://gus.fzk.de/ws/ticket_search.php?ticket=44987
https://gus.fzk.de/ws/ticket_search.php?ticket=42668
Reason for escalation: no answer from site since 1 month and still in error on SRMv2-get-SURLs (GGUS #44443), sBDII-performance (GGUS #44987), APEL-pub (GGUS #42668)
Answer from French ROC : As this site has never reached a sustainable production level since certification, it has been decided by French ROC, with the agreement of the site, to restart the whole certification process from the beginning. By the way, this site has been put in "uncertified" status, and is now out of production.

PPS Reports and Issues

  • Post-mortem of recent issues with releases, we would like to improve roll out of BDII by adding a production site doing an early deployment of the BDII updates and reporting about the results BEFORE the release goes officially out
  • Pilot service of SC5 WN: in progress

gLite Release News

  • No release news as of this week

CREAM-CE (for Alice)

Tier-1 sites in particular are encouraged to install 1 or more CREAM CEs.
Only available at 3 sites, one of them a T1, FZK. We would like to encourage other sites/T1s to follow, we are giving support to do this. What is the status at CERN? We are focusing on SLC5 first; once we have it ready, and depending on manpower, we will start working on that after, no clear estimation about when. What about the other T1s? no answer. Call to other sites to follow and start deploying CREAM CEs. Antonio: also the option of deploying the PPS version of the CREAM CE; difference? Functionally the same, but version in PPS works well with ICE, version in production doesn’t. More discussion on Wednesday at the GDB.

Sites supporting BioMed VO: Please update GFAL

Can all ROCs contact their sites which support the BioMed VO to ask them to update their WNs with the latest version of GFAL, please. Discussion: the problem is that when we use old versions of gfal API, older than 1.10.6, it kills the LFC; we want sites to upgrade to newer versions, we were suggested to develop a SAM test to detect this. Daniel Jouvenot (name supplied by John after the meeting) from Biomed VO has already worked with SAM and should have experience with it. This is the long term solution, the short term could be solved on the LFC side: remove LFC list replica from all LFCs; the DM team will check this option to produce a version of the LFC without this only for Biomed, no official release; to be discussed with Akos.
For the record, from the SAM team leader: Detailed version testing should not be SAM responsibility, it is more configuration management (being done by the new job wrappers, ongoing work). This is expected to be ready in a few months.
Long term solution: VO specific SAM test.
This GFAL version is from the end of 2007! operations will send a broadcast to all sites, on behalf of biomed, or to sites that support the biomed VO, requesting the upgrade.

EGEE Items From ROC Reports

  • Central Europe: Two cases conserning lack of procedures how site shall set default SE:
    • The EGEE SLA allows the site to be CE-only (Section 8: site must provide at least one CE OR SE). Not having SE affects on passing by site RM SAM tests - those tests take closest SE (default). Also setting up site in such situation is not possible because yaim require SE. Comment: maybe this is a problem with our interpretation of 8 section in SLA. Doeas this section says that Site can have "CE OR SE" or "CE with >=8CPU or SE with >=1TB"? If the secound option then the 8 section in SLA can mislead.
    • In case of putting SE in Scheduled downtime, site have to put also CE into downtime (otherwise will not pass RM tests) or chose (lack in procedures) other SE (from other site).
ANSWER: Originally, for the availability reports, a site needed to have all site services (CE, SE, SRM and sBDII) - but this requirement was relaxed depending on the resources being provided. However, a “close SE” still needs to be defined for the CE tests to pass. This “close SE” does not have to be at the site. The ROC can be involved to help a site define a suitable SE.

  • DECH: GSI-LCG2 is down because of bugs in the 64 bit WN package, see GGUS Ticket-ID: 48013 How to deescalate this situation?
COD: it was escalated 2 weeks ago, and it is deescalated now; not sure if the problem is related to the 64 bit WN, the site seems to be working on it;

  • SouthWest Europe: SWE will have a new site RedIRIS, which wil only host central services (Top-BDII, WMS, LFC, MYPROXY etc.). This configuration will cause problems in GSTAT because some necessary variables will not be defined. Will this configuration be supported in the future? Is there a work around for this type of sites?
We hope that this configuration will be supported. We’ll look to check if there is any similar case already. Action on the CERN-ROC to check. Kai: we’ll go ahead, see what breaks and report about it.

WLCG Items

WLCG issues coming from ROC reports

Upcoming WLCG Service Interventions

  • Consult links on the agenda page. RSS feed is now working, people can subscribe

WLCG Service Coordination

second run of atlas 5 million test starts on Wednesday, T0-T1 and T1-T1.

ATLAS Service

Problem found during Xmas vacation with SAM WMS, which caused many other problems, test framework was stuck. Problem is now understood ans should be solved soon. The SAM WMS will authorize Atlas to submit jobs through it. In addition, NIKHEF requested to not to have warnings through this WMS but a different message/exit code, simple NOTE instead of WARNING. This will be changed in Atlas tests. The change is already done in SAM and should be included in next release, coming next week.

ALICE Service

Nothing to report

CMS Service

  • Tier-0 = The DataOps team kept the T0 resources mostly saturated throughout the winter break. This turned into being able to repack & prompt reco all CRAFT completely twice, and CRUZET + BeamCommissioning 3 times (4th running last week). Results written to disk-only pools, and promptly recycled as needed. Main issues: 1. some on CMS T0 code(s) (--> FIXED); 2. CERN resources behaved well except for some LSF failures on the weekend Jan 3-4 (--> FIXED); 3. some lessons learned in data handling over vast datasets at T0 (CMS-specific lessons) (--> being addressed).
  • MC production = Summer08 phase. physics requests count for 253 M events produced (GEN-SIM-RAW, CMSSW_2_1_7); 208 M events reconstructed (CMSSW_2_1_8). --- Fall08 phase. MadGraph requests with CMSSW_2_1_17: 15.6 M evts produced + reco'ed, plus 1 RAW workflow and 1 RECO workflow still running (only some problems with a workflow, not yet working with a even patched version of ProdAgent PA_0.12.9). --- Winter09 phase. FastSim requests with CMSSW_2_2_3. 45 requests were assigned to be run during the Xmas break. 44/45 DONE, remaining 1 just skipped by DataOps. Total: 342 M evts produced. --- Summary of issues (breakdown with site issues only): just a couple of T2 sites had tmp issues, all fixed/bypassed.
  • Reprocessing at T1 sites = 1) CRAFT activities: CRAFT data AlCaRECO and skims ran in IN2P3, FZK, PIC; of the order of ~50k jobs/workflow. IN2P3 had storage-SRM related issues over the Xmas break. Many issues with the glideins: some solved by DataOps submitters, some jobs ran, but at a somehow limited rate: not easy. In addition, re skims: 5 workflows, problems with the RECO-RAW output and needed fix to DBS to get it sorted out. 2) re-digi and re-reco: also tried to move from glideins to glite, also had problems, had jobs running for a while, then turned out to give errors, etc. In this case, AFAICT they were mostly identified as a site issue at PIC. Unfortunately, no tickets were opened by operators (to improve much in this).
  • Transfer system = No major issues with the transfer system. A total of 175.11 TB transferred over winter holidays. Just one event: PhEDEx Castor-related export agents at CERN were not responsive on a Friday morning (I recall it to be Jan 2rd): auto resolved problem.

LHCb Service

Up to 15000 jobs concurrently in the grid, impressive! Sites are complaining about pattern usage (SQL lite), stored in NFS mounted shared area, which causes all processes to hang. People is working to access locally the files instead of doing it through the shared area in NFS.

OSG Items

Discussion of open tickets for OSG:

  1. https://gus.fzk.de/ws/ticket_info.php?ticket=44104: "A Nebraska site publishes the GlueSite object twice with 2 different base DNs". They are trying to figure out how to fix it within the OSG architecture, no short term resolution
  2. https://gus.fzk.de/ws/ticket_info.php?ticket=44140: "The site BU_ATLAS_Tier2 publishes information which are not Glue v1.3 compliant "
  3. https://gus.fzk.de/ws/ticket_info.php?ticket=44837: ""lsm-get failed" error occurred at site "HarvardU" under BNL, "

2 and 3 are being looked at.

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2009-01-31 OCC to send broadcast to sites requesting to upgrade the GFAL version so it is higher than 1.10.6
More details about the issue can be found here: https://gus.fzk.de/ws/ticket_info.php?ticket=43994

Update 19/1/2009: Biomed GFAL version problem, Maite will send broadcast after the meeting (seems some sites are still on SL3 and need to upgrade the O/S as well as GFAL!)

Update 26/1/2009: no broadcast seen, OCC to follow up.

Update 2/2/2009: broadcast not sent, problem being followed up with sites through GGUS. Agreement to close item.

2009-02-03 edit

Assigned to Due date Description State Closed Notify  
Main.Akos 2009-01-31 The Data Management team (Akos) to provide a version of the LFC without list replica (related to the old GFAL version problem reported by Biomed)

Update 19/1/2009: (mail from Akos):
We have examined the issue and it does not look like a security problem, but a resource limitation: the number of threads in an LFC instance limits the number of clients that can connect concurrently and the Biomed usage patter exceeds that limit.
When the clients would finish their work, LFC would be responsive again.

The same problem would occur with other iterator like operations, like opendir/readdir/closedir.

Removing these operations would cause old clients to fail, however it would not solve the problem, so in my opinion the upgrade of lcg_utils is the right solution.

Unfortunately nobody has contacted us from the Biomed community regarding the possibility and context of a special build, so we did not progress on that side.

Update 26/1/2009: Can be closed.

2009-01-27 edit

Assigned to Due date Description State Closed Notify  
Main.Biomed 2009-02-28 Long term solution to the old GFAL version problem reported by Biomed: develop VO specific SAM test to detect this, and then exclude the sites with the wrong version

Update 19/1/2009: Long-term solution could be SAM tests, or adding GFAL version collection to job-wrapper scripts.

2009-01-27 edit

Assigned to Due date Description State Closed Notify  
Main.SAM 2009-01-31 SAM and Atlas (Alessandro) to get together to understand how SAM-Atlas deals with sites with no close SE defined and see if this can be used in SAM-operations

Update: 19/1/2009:
The outcome of the get-together was:

>> Not having SE affects on passing by site RM SAM tests - those tests take closest SE (default).
This is incorrect – the defined SE doesn’t have to be at the site!

>> Also setting up site in such situation is not possible because yaim require SE.
Correct, but again the SE doesn’t have to be local to the site.

>> In case of putting SE in Scheduled downtime, site have to put also CE into downtime (otherwise will not pass RM tests) or chose (lack in procedures) other SE (from other site).

This is correct, and the only real issue. ATLAS doen’t use Replica Management tests, but believe that they should be part of the ops infrastructure tests (which are more extensive). There may be a case for making the replica management tests non-critical, but they’ve been critical for two years now and most people seem happy with this.

The way for a site to change the defined SE is to modify the variable VO_OPS_DEFAULT_SE in the WNs’ site-info.def files.

2009-01-27 edit

Assigned to Due date Description State Closed Notify  
Main.CERN-ROC 2009-01-31 Check of existing cases of sites only hosting core services, without site services. This is to support a new site RedIRIS in SWE ROC

Update 19/1/2009: CERN ROC to check sites with only core services – no progress.

Update 2/2/2009: New SWE site RedIRIS will only host core services (BDII, WMS, etc.)

Problems until now:

1) GIIS performance error due to: GIIS Old Entries Found: 6 - ERROR
- This will make the SAM test gperf fail.

2) No Grid Version published: GridVersion: *NOTE* could not find valid LCG version
- This ist just a warning in GSTAT at this moment

The other tests seem to work only the gperf error is critical.

Update 12th February - Steve will take a look to understand what this is about.

Update 19th February - Steve - Confused , there is no RedIRIS site in gstat? http://gstat.gridops.org/gstat//SouthWesternEurope.html

Update at the meeting - Kai will check.

edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB

None

Next Meeting

The next meeting will be Monday, 19 Jan 2009 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0148141


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2009-02-27 - NickThackray
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback