WLCG-OSG-EGEE Ops' Minutes Mon 10 Mar 2008

Attendance

EGEE

  • Asia Pacific ROC: Min
  • Central Europe ROC: Marcin
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Maite
  • French ROC: Gilles, Osman
  • German/Swiss ROC: Sven, Clemens
  • Italian ROC: Daniele, Alesandro
  • Northern Europe ROC: Jules
  • Russian ROC: Victor
  • South East Europe ROC:
  • South West Europe ROC: Kai
  • UK/Ireland ROC: Derek
  • GGUS: Helmut
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry

WLCG Tier 1 Sites

  • ASGC: Min
  • BNL: Absent
  • CERN site: Ignacio Reguero
  • FNAL: Absent
  • FZK: Sven
  • IN2P3: Piere
  • INFN: Alesandro
  • NDGF: Mattias
  • PIC: Gonzalo
  • RAL: Derek, Matt, Catalin
  • SARA/NIKHEF: Jules
  • TRIUMF: Absent

VOs

  • LHCb: Roberto Santinelli
  • CMS: Andrea Sciaba
  • ATLAS: Alessandro Di Girolamo

Reports Not Received

  • VOs: ALICE, ATLAS, LHCb.
  • EGEE ROCs (Prod Sites): UKI

Feedback on Last Week's Minutes

None were given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From Italy SWE
To Russia DECH

  • To comment that on Thursday, 6th of March, the COD portal was unavailable.
Backup Team (SouthWestern Europe):
  • 1st mail: 18
  • 2nd mail: 18
  • Site OK: 31
  • Quarantine: 5

PPS Reports and Release News.

Glite 3.1.0 PPS Update 20 finished the pre-deployment phase and it is now available in the public PPS repository. In particular, this update contains:
  • glite-MON for gLite 3.1 / SL4

Release notes in: PPSReleaseNotes_310_PPS_Update20 Pre-deployment tests reports in: http://www.cern.ch/pps/index.php?dir=./release/testreports/gLite3.1.0/ and PPSReleaseNotes_310_PPS_Update20.

  • Question: What about WMS for SL4 status?
  • Answer: Antonio says not far to go with this.

EGEE Items From ROC Reports

  • (ROC CE): Two questions about availability calculation.
    1. Could we present what fraction of unavailability periods is considered by sites as non-relevant? Site admins fill in weekly reports and put such an information about each individual SAM test failure so the data is there. In our view, this information can allow to identify areas to improve in terms of availability.
    2. Would it be possible to implement mechanisms for automatic removal of periods in which sites failed due to some monitoring-related problems like this one: SAM Result Link
      • It is hard to retro-fit gridview data and correct it.
      • Future transport based on Active MQ (with buffering) will help greatly.
      • Osman and Marcin will contact one another to try and resolve this.
      • Next week Marcin will provide some clear examples of the kind of thing that removing. Action item added.
        • Sven also seconds this motion for the DECH feeling.

gLite Release News.

gLite3.1 Update16 was released to production today, the update contains:

  • A new index on the attribute GlueServiceEndpoint, used by lcg-utils
  • UI: Bug fixes to jdl API (bulk submission) and gfal clients
  • dcache SE: Glue 1.3 clean ups and bug fixes
  • DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP from 2.6.2 -> 2.7.6b
  • GFAL version 1.10.8-1: creation of subdirectories with lcg-utils
  • lcgCE: bug fixing

Release notes: http://glite.web.cern.ch/glite/packages/R3.1/updates.asp

WLCG Items

WLCG issues coming from ROC reports

  • (France ROC): A lesson learnt from CCRC08 is that some VOs don't consider the status published by a CE queue, so that they can wrongly submit on queue with a non-Production status. Indeed, at IN2P3-CC, for the purpose of an Atlas-Cms combined test, we had set 2 queues with a status "TEST" in order to restrict access to jobs that had explicitly required this status, but after a while we noticed plenty of regular ("production") jobs on those queues. Please check the queue status before submitting, it must be set to "Production".
    • Pierre Was seen with CMS and Atlas jobs.
    • CMS Can you produce a list of users?
    • Pierre will submit GGUS tickets to the VO when it happens again.

Upcoming WLCG Service Interventions

  • PIC has a long downtime nextweek from Monday 17th for 4 days.
  • FZK at risk on Saturday for database updates hosting 3D service.

ATLAS Service

  • Request to Atlas sites to upgrade WNs to SL4
  • List of sites having CEs that atlas can use, by OS: atlas-sites-by-os.txt
  • Finished M6 cosmic detector run.
    • Took 40 GBytes of data.
    • Problems with CERN on Friday but all resolved promptly.
  • This week T1 T1 functional tests: all T1 vs PIC and all T1 vs Lyon.
  • Next week performance tests: throughput test, T0-T1-T2

CMS Service

Data certification, T0 status and reprocessing
All activities suffered from the LSF incident (full log by CMS at CMS.FacOps-IncidentCERNLSF-Feb28Mar07, discussed with Bernd/Ulrich at the FacOps meeting - see bottom of http://indico.cern.ch/conferenceDisplay.py?confId=30054). Hard week for RelVal atCERN, also (the LSF issue left CMS behind in release validations). FastSim production was proceeding fast before the problems (6k/15k proc jobs complete), and recovered soon after. --- Good progress on the StorageManager side, identified and configured the nodes to be used in the Global Run inMarch.
Re-processing
On CSA07 signal workflows, ~6M of GEN-SIM input evts have just arrived at T1's; ~17M processed evts last week. Processing running at FNAL also. FastSim production finalized with CMSSW_1.6.9 (+ 2 additional tags for the config files CMSSW_1.6.10) about ~100M PDAllEvents from the 3 soups (RelVal samples). No site issues at ASGC, CNAF, FZK, PIC, RAL; at FNAL, jobs take too long due to a dCache issue, being investigated; at IN2P3, problems in the pool area, several days without being able to merge jobs, now solved and production is already back on-schedule. --- Ran some post-CCRC reprocessing jobs with ATLAS: some lessons learned at IN2P3 and PIC (to long to report here).
MC production
~85M CSA07 Signal requested events were done, now available for reco. 56 workflows for ~3M requested events still to be done. Two types of problems (all CMSSW-related, so not worth mentioning here). 4 finished datasets (4M events, 1.45TB) are subscribed but not yet transferred to any T1 MSS. --- 1 DPG workflow (2 Mevts): GEN-SIM is done. Transferring. --- HLT: running (it's CMSSW_1_7_4, GEN-SIM-DIGI-RAW), 1 big workflows (10 Mevts) in production now, ~2 Mevts are done. --- Detailed summary of current production activities at http://khomich.web.cern.ch/khomich/csa07Signal.html.
Data Transfers and Integrity, DDT-2/LT status
/Prod transfers: proceed, 16 TB/week this week, no major problems. /Debug transfers: new links are commissioning with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding, generally very successfully: 78% of the previously commissioned links have already PASSED the new metric as of 6 March 6th. We have 285 commissioned links (as of March 6th). The breakdown is: 55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing); 142 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. First round of testing almost complete. Sites can take advantage of the gap before the second round to commission new links or recommission failed links. Real problems found, fixed during exercising, first "success stories" in troubleshooting being documented. --- Full details at CMSDDTLinkExercising.

SRM versions 1s Dropped from SAM

CMS also reported the problem that many SRMs had vanished from SAM testing: This in brief was attributed to some SRM version 1s publishing:
GlueServiceType: srm
GlueServiceVersion: 1.N.0

while this is perfectly correct the SAM2BDII script is broken recognising this:

Create an action item on the SAM team to resolve promptly.

LHCb Service

  • Working to get DIRAC3 ready for May running and also stripping for this CCRC phase
  • Imprementing SAM tests for SRMv2 (we envisage some transfers from X to Y)
  • Question: What about PIC and NIKHEF implementation of read only LHCb LFC system?
    • Gonzalo - We hope to have it at PIC by May (hopefully in April). After Easter.

OSG Items

Note in ticket GGUS:31037 comment from:
  • dimou 2008-03-10 16:14 in Public Diary saying:

As nobody joined the session OSG-GGUS issues from the Operations' meeting of 2008-03-10 can OSG Ticketing system experts and the GGUS develoepers debug this please?

How does one explain that it remains open despite the comments in the public diary above?

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.Marcin 2007-03-19 Marcin to produce a list of examples where a site failure is attributed to a central service failure.

*Update 19th March*: Marcin supplied some examples. Problem is well understood, solution is less obvious. John to work with SAM & GridView team.

2008-03-21 edit
GridView 2007-04-27 Please look into GGUS:33850 concerning transparent downtimes affecting site availability.

*update: 20/3/08* GridView team has fixed the bug (CVS tag gridview-synchronizer-20080318).

*update: 7/4/09*: ticket and action re-opened because also gstat needs a fix.

2008-03-21 edit
Main.SAM 2007-03-19 Sam team to investigate promptly the BDII2SRM script to recognise GlueServiceType/Version SRM/1.10 correctly. GGUS:33726, BUG:31940

*Update 13th March 2008*
!BDII2SAM script now fixed, action should be closed following next meeting.

*Update 31 March:* Script is fixed. Close.

2008-04-02 edit

Review of Open Action Items

136
ROCs please check list and produce. What about running with SL3 and SL4. Steve create another finer report.
137
100 GigaBytes. Atlas cannot check, at the end of the month the action will be closed.
139
Received feedback from Italy, Southwest, and a few others. Close action.
141
No change. ... Pending on WMS for SL4.
142
No change.

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB.

  • CIC portal is currently broken? check with Osman.
  • Sven GGUS:33850 OCC please have a look.
  • LHCb reported to GridView was reporting very much the wrong number of running jobs - R-GMA the transport.

Next Meeting

The next meeting will be Monday, dd mmm 2007 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2008-05-06 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback