Week of 100614

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Harry(chair), Jaroslava, Laurence, Ulrich, Eva, Lola, PeterK, Andrea, JanI,Jamie, MariaG, MariaDZ, Alessandro, Jean-Philippe, Gavin, Dirk, Roberto, Simone);remote(Jon+Catalin(FNAL), Joel(LHCb), Gonzalo(PIC), Michael(BNL), Gang(ASGC), Angela(KIT), Gareth(RAL), Rob(OSG), Rolf(IN2P3), Onno(NL-T1)).

Experiments round table:

CERN-PROD_SCRATCHDISK: issue got solved on Saturday morning, Thanks! https://gus.fzk.de/ws/ticket_info.php?ticket=58904

Overloaded BDIIs at several sites over the weekend resulted in failing jobs with problems retrieving inputs: Australia-Atlas, JINR, Nikhef, INFN-T1. It's becoming an issue over past 2 months. What is the underlying issue? Laurence Field asked for more details - which sites and clients. Alessandro reported CERN, RAL and INFN and that worker nodes querying their local bdii were timing out. Peter Kreuzer thought CMS are also seeing this in some SAM tests at INFN and some Tier 2s. Laurence thought the client timeout was hard-wired at 15 seconds but he will follow-up this up together with Alessandro.

Dear T1s, please take opportunity of this week to perform FTS upgrade! LHC is commissioning this week, therefore not much data to take, no data to export to T1s from ATLAS (except functional tests).

T0 Highlights 1) cosmics. 2) data taking during machine development. 3) CSM DB issue from last week : T0 Data Base Service (TOAST) ok now, after restarting DBS, moving it away from the DB-node TOAST is using. 4) Today : intervention on the storage manager DB at point 5. Afterward affected the CMS T0 (SM host down ?); T0 team changed settings and then working again.

T1 Highlights: 1) 2 GGUS Team tickets opened to INFN-T1 ((tickets :https://gus.fzk.de/ws/ticket_info.php?ticket=58920 and https://gus.fzk.de/ws/ticket_info.php?ticket=58987), may or may not be the same issue), both related to SW configuration : admins at INFN-T1 made a general fix and asked CMS to confirm things are fine now, to be followed up 2) BDII not Visible issue at INFN-T1 on Saturday Jun 12, covered in the same ticket as above : https://gus.fzk.de/ws/ticket_info.php?ticket=58987

T2 Highlights: 1) MC production as usual. 2) BDII not Visible issues at the 2 CMS Lisbon T2s : see output from the Dashboard Site Status Board : http://dashb-ssb.cern.ch/templates/cache/bdii_log.html#T2_PT_NCG_Lisbon and http://dashb-ssb.cern.ch/templates/cache/bdii_log.html#T2_PT_LIP_Lisbon

GENERAL INFORMATION: Pass 1 reconstruction activities together with two analysis train ongoing. No MC production activities during the weekend. In terms of raw data transfers, very low activity for the moment

T1 sites: CNAF: on Sunday morning the experts detected a problem at the local ce07 (CREAM). Connections to the service were being refused at submission time. The site admin was immediately informed and he took actions in few hours (therefore there is no GGUS ticket)

FZK: also on Sunday morning the restart of all the local services at both VOBOX was neccessary. ALICE experts investigating why this activity is needed from time to time at several voboxes

T2 sites: Clermont: On Saturday night experts detected a wrong information reported by the local information system of the CREAM-CE: A large amount of Alice jobs appeared in status running while the experiment had stopped submitting new agents since almost 24h. The issue was reported to the ALICE expert at the site during the weekend and this expert has confirmed this morning the problem is gone. Site is back in production

Cagliari and CyberSar-Cagliari: Both sites are out of production. The local ALIEN user proxy expired in both local VOBOXES. The responsible has been informed, waiting for his actions

Experiment activities: Running several MC productions at low profile. Merging production. GGUS (or RT) tickets:

T1 site issues: CNAF no shared area variable defined (GGUS:58985). IN2p3: SRM endpoint not available on Saturday. SAM tests confirm this outage. GGUS:58994

T2 sites issues: Shared area issues at: BG05-SUGrid(GGUS:59015) IL-TAU-HEP(GGUS:59007).

Sites / Services round table:

FNAL: Up to date with FTS.

ASGC: Will migrate to FTS 2.2.4 tomorrow.

KIT: Migrated FTS last week. Now planning to go to FTS under slc5. Had a hardware problem with their CMS dcache headnode during the weekend - 4 hour downtime.

RAL: Had bdii issues over the weekend (had missed an upgrade) and were failing ops SAM tests. Ready to perform the FTS upgrade and proposing Wednesday. Installation problems with new disk servers last week for ATLAS and CMS - fixed by Friday. Gareth queried if the SAM to Nagios switchover is still scheduled for the 15 June. Jamie replied this date had been to mesh with an MB meeting that has now been postponed. There are differences in the test results (there are different algorithms of course) that MB members would like to fully understand so there is no new date yet. Gareth also queried if the access to the results database will change - Harry to follow-up.

OSG: Problems in the OSG-GGUS interface on the OSG side over the weekend but no updates were lost.

IN2P3: The LHCb srm crash outage was due to a known bug which is protected by an autorestart and the service was back before the ticket was created.

NL-T1: Will upgrade their FTS tomorrow.

CERN CEs: Planning to migrate four lcg-CE that submit to slc4 to submit to slc5 as soon as possible and to update the CREAM-CE to release 3.2.6. Will take some time as they need to be drained of jobs.

CERN CASTOR: Adding 128 TB to the ATLAS Tier 3 disk pool as requested by B. Panzer.

AOB: Simone reported that Friday's high OPN traffic between CERN and RAL was an ATLAS user of CERN LSF batch reading data from RAL. He was surprised that CERN worker nodes are on the OPN as he did not think other Tier 1 were configured that way. Gareth pointed out that if not on the OPN then the GPN would have been overloaded.

Tuesday:

Attendance: local(Harry(chair), Alessandro, Lola, PeterK, JanI, Ulrich, Roberto, Maarten, Gavin, Eva);remote( Angela(KIT), Michael(BNL), Joel(LHCb), Jeremy(gridpp), Catalin(FNAL), Gang(ASGC), Ronald(NL-T1), Rolf(IN2P3), Tiju(RAL), Rob(OSG)).

Experiments round table:

  • ATLAS reports - CERN-PROD_SCRATCHDISK: locality of some files in CERN-PROD_SCRATCHDISK is LOST GGUS:59035. Jan Iven reported this is due to the disk being drained. The system knows the files are really NEARLINE (i.e. on tape) and cannot access them with minimal latency so flags them as LOST. Maarten thought that srm-ls should not, however, report them as LOST which it apparently does.

Today Downtime: 1) SARA-MATRIX FTS upgrade 9 - 12 CET. 2) TAIWAN FTS upgrade 10 - 11 CET. 3) BNL FTS upgrade 17-18 CET.

Last Friday and Monday the ATLAS disk space availability monitor in the CERN SLS was showing grey for dcache sites due to their CRLs not being updated. Maarten reported two problems - the master CRLs at CERN have a lot of outside accesses and are on busy AFS volumes (this is being worked on) - and the dcache client insists all lines of a CRL are parseable and recently a French site had a zero length record which killed the client.

  • CMS reports - T0 Highlights: 1) cosmics. 2) data taking during machine development.

T1 Highlights: Many re-reco and skimming jobs stuck at various T1s due to central WMS issue, see http://savannah.cern.ch/support/?115119. identified by WMS admins as a problem with condor in wms012 at CNAF. It is stuck with 40Kjobs, the wms is now drained and not accepting jobs any longer. The admins are debugging the issue. Jobs will be aborted manually and need to be re-submitted urgently.

T2 Highlights: MC production as usual.

Weekly-scope Operations plan

[Data Ops]:

Tier-0: data taking if machine provides stable beam, otherwise modest testing

Tier-1: re-reconstruction passes and MC re-digitization/re-reconstruction

Tier-2: Plan on running full-scale MC production at all T2s

[Facilities Ops]:

Fixing various bugs on the CMS WebTools front, which had to be often restarted or operated manually in last 10 days : SiteDB, PhEDEx Web

Finalizing Vidyo accounts for remote CMS Centers participating to Computing Shifts

Continue to test and integrate Critical Service recovery procedures for Computing Run Coordinator (CRC)

  • ALICE reports - GENERAL INFORMATION: Four analysis trains ongoing today with no MC production activity expected for the moment.

T0 site: No issues to report

T1 sites: 1) CNAF: The site admin is still working on the local CREAM-CE issue reported yesterday. The CREAM-CE developers have suggested to migrate the current service to CREAM1.6/SL5. 2) Minimal activity today at the T1 sites

T2 sites:

Cagliari and CiberSar-Cagliari: The issue reported yesterday (Alien user proxy expired) has been solved. Both sites are back in production

RRC-KI: The ALICE responsible person has announced a cooling problem at the site. Services out of production

Kosice: CREAM-CE out of production. Issue announced to the responsible peron at the site

Hiroshima: The local SE has been taken out of production for xrootd update

  • LHCb reports - Experiment activities: 1) Running several MC productions at low profile. 2) Merging production.

T0 site issues: FTS transfers out of CERN failing (GGUS:59037). Jan Iven reported there are 2 overloaded disk servers in the default service class which is the first one looked at by srm. This order can be changed so LHCb should consider this.

T1 site issues: CNAF : Problems transferring to CNAF with FTS (GGUS:59038)

Sites / Services round table:

  • BNL: Performing the FTS upgrade - a downtime of less than 1 hour is expected.

  • ASGC: The FTS upgrade was completed this morning.

  • NL-T1: The FTS upgrade was completed this morning.

  • RAL: The FTS upgrade will be done tomorrow morning.

  • CERN CE: The four CE124 to 127 will be drained tonight to be converted into submitting to slc5 worker nodes. This will still leave 8 submitting to slc4.

  • CERN CREAM-CE: Planning to upgrade one CREAM-CE to release 1.6 on Thursday. Would like ALICE to confirm ahead.

  • CERN CASTOR: There was a 20 minute glitch on castorlhcb this morning while moving machines.

  • CERN VOMS: The host certificate of voms.cern.ch will be updated on Wednesday 16th June 10:00 CEST. If you have the lcg-vomscerts package installed on your service then you must have updated to version 5.9.0-1 of this package by this time.

AOB: Laurence Field has looked into the ATLAS bdii client timeouts reported yesterday and concluded they were mostly individual glitches. The RAL incident happened because the CERN bdii exceeded 5MB with the addition of some software tags and RAL had not yet applied the increase to 10MB that was in the last release of the bdii configuration. Laurence/Maarten have suggested that the GlueLocation object, which takes 1.5MB, may not be necessary as its information is duplicated in the SoftwareRunTimeEnvironment so would experiments please check if they use this object and let us know (wlcg-scod@cernNOSPAMPLEASE.ch).

There will be a reduced attendance for the next 3 days due to the data management jamboree in Amsterdam.

Wednesday

Attendance: local(Harry(chair), Lola, Ulrich, JanI, David, MariaD, Eduardo, Oliver, Alessandro);remote(Gonzalo(PIC), Cristina(CNAF), Gang(ASGC), Xavier(KIT), Rolf(IN2P3), Joel(LHCb), Tiju(RAL), Rob(OSG), Catalin(FNAL)).

Experiments round table:

Today Downtime: RAL FTS upgrade 9 - 13 CET.

  • CMS reports - Tier 0 : following the LHC schedule and testing the Tier 0 processing infrastructure.

Tier 1: CNAF have patched their WMS that caused many stuck jobs yesterday and the jobs are running again but merge jobs cannot be dispatched (they need processing jobs to report back) and Tier 1 merge areas are filling up. The 175TB at FNAL is already full and processing has been stopped there. Would other sites please tell CMS if their merge areas are getting full. CMS estimate they have lost 4 days worth of processing (the ticket was raised on a Saturday but not acted on till Monday) and will be thinking again about how to handle priority problems. Over the same weekend there was a record of 16000 concurrent jobs at Tier 2.

T0 site: Concerning the migration to CREAM1.6 of the current CREAM services at CERN, ALICE gives green light for this migration. Please inform the experment about date-time when to stop sending jobs to these systems.

T1 sites: Raw data transfer yesterday to FZK and CNAF with no incidents to report . All T1 sites in production today with a minimal activity.

T2 sites: KFKI: last site without CREAM system. The system has been fully configured and announced to Alice. currently working in the testing phase to put it in production

  • LHCb reports - 1) So far no replies on the CERN and CNAF tickets from yesterday. For CERN Jan Iven clarified that LHCb is best placed to decide on the order in which srm searches CASTOR service classes. Joel would, however, like to know if there are any side effects so Jan will reply to that. For the CNAF FTS problems Cristina reported there have a disk problem that is being worked on then the ticket will be updated. 2) Ok for CERN to go ahead with the CREAM-CE upgrade on ce201. 3) At CERN bjobs -N does not seem to return the right normalised cpu times used by a job - Ulrich will take this offline. 4) Will reply to PIC about failing LHCb jobs on Monday. 5) A CERN ticket on failure to receive an alarm SMS had been closed without comment but now seems to be working as it should. Harry will follow up.

Sites / Services round table:

  • PIC: FTS 2.2.4 was installed last week.

  • CNAF: Finishing FTS 2.2.4 testing - hoping to make the upgrade tomorrow.

  • BNL (by email): FTS was successfully upgraded at BNL to v 2.2.4.

  • NT-T1(by email): NTR.

  • RAL: FTS upgrade completed this morning.

  • FNAL: CMS merged area full and failed a few SAM tests.

  • CERN CEs: The 4 lcg-ce submitting to slc4 are now being drained. Will try and migrate them tomorrow in the same time slot as the CREAM-CE.

  • CERN Databases: Addressing what seems to be an Oracle bug affecting the LHCb dashboard.

  • CERN VOMS: Host certificate for voms.cern.ch was updated at 10:45 this morning.

  • CERN CASTOR: ATLAS SCRATCHDISK draining has finished and all files are accessible again. Are replacing some hardware in the LHCb MDST space token which will provoke some disk to disk copies.

AOB: Alessandro reported that ATLAS had closed the ggus ticket as solved but left open the Savannah report. MariaD reminded that normal policy is to only close a ggus ticket that led to a Savannah bug report when the bug is closed in Savannah but agreed with Alessandro that this ticket was different in that the issue was innaccessible files which are now accessible so it makes sense to close this ggus ticket. Should a similar incident happen a fresh ggus ticket will be raised.

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 08-Jun-2010

Edit | Attach | Watch | Print version | History: r13 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2010-06-16 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback