Week of 130204

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance: local(AndreaS, Alexandre, Rod, Mark, MariaD, Jerome, Ian, Maarten/ALICE, Jan, Ueda);remote(Michael/BNL, Saverio/CNAF, Onno/NL-T1, Wei-Jen/ASGC, Pepe/PIC, Lisa/FNAL, Roger/NDGF, Rolf/IN2P3, John/RAL, Dimitri/KIT, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • IN2P3-CC lcg-cp problem for big files
      • Fixed in latest EMI update but some issue with tarball on AFS. Critical for merge type jobs.
        • Is this the semi-official (UK) tarball? It is updated? [Maarten will alert the maintainer to urgently update the tarball with the patch. Rolf and Maarten urge ATLAS to create a ticket, even if the ATLAS contact is already aware of the problem. Maarten will point Rolf to a page explaining how to set up an EMI-2 tar ball WN.]
      • Reminder to upgrade. Big output means long job means big waste if it cannot store output.
    • QMUL to RAL transfer backlog due to FTS returning error on poll for a particular job. FTS sees a special character in SRM output.
      • Caused by StoRM but FTS and DDM handle the consequences badly. [Maarten finds strange that FTS2 is seeing this problem now; maybe it is a consequence of a fix at QMUL made to make FTS3 to work?]
    • Panda holding jobs increases on single panda node. voatlas250 seems to have been preferred by pandaserver.cern.ch DNS round robin.
    • Many voatlas VMs were hard rebooted over w/e. SS services did not come back cleanly. All recovered now.
    • Friday afternoon IT change on lxbatch led to xrdcp failure for grid prod and analy jobs. It was fixed by ATLAS changing setup.

  • CMS reports -
    • LHC / CMS
      • Starting the final week
    • CERN / central services and T0
      • Problem reported on Castor Xrd reads (GGUS:91141). This is also the error we saw on the HLT Cloud testing. [Jan: this is being investigated.]
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: ran OK with ~4500 concurrent ALICE analysis jobs during the weekend (for production jobs the limit would be 6k).

  • LHCb reports -
    • Activity as last week: reprocessing, prompt-processing, MC and user jobs.
    • T0:
      • An issue with a NAGIOS test not running in the last few days (GGUS:91126)
    • T1:
      • RAL: Some FTS transfers are failing due to strange timeout during transfer. Only on some files. Experts are investigating. [John informs us that the ticket has just been updated.]

Sites / Services round table:

  • ASGC (Wei-Jen): In Taiwan we will have 9 days holidays for Chinese New Year from Feb. 09 to Feb. 17. During the holiday all services at Taiwan are still online and production. We have on-site staffs cover service operation in daytime (0200 - 1100 UTC) during holidays. We will also check ticket assigned from GGUS and Savannah as usual and try to solve problem as soon as possible.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: during the weekend there was a problem with a switch at the Finnish site, causing connection problems. It was replaced today, everything seems OK. For tomorrow there is a scheduled downtime for some ATLAS and ALICE disk pools; it should last for about one hour.
  • NL-T1: ntr
  • PIC: the period at reduced capacity to save on electricity costs has ended and PIC is now back at 100% capacity since a few hours.
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: last Saturday there was a problem with the batch service: jobs would not start running. It was understood and solved in a few hours.
  • CERN storage services:
    • Last Saturday the DN-to-user mapping in EOS was broken due to the unavailability of a mapping file on AFS, maintained by CMS. The mapping is normally updated only once per day and on week days, because in the past there was the suspicion that changes in it might cause BestMan instabilities and not to overload the VOMS servers. Maarten says that it should not be a problem to contact VOMS even once per hour.
    • Finished a transparent upgrade to EOSCMS.
    • Deployed a new xroot client for ATLAS which turned out to be broken, from Friday afternoon to Saturday when ATLAS fixed the problem. A SIR will be prepared.
    • Yesterday the VM reboot killed all xrootd ATLAS redirectors. The outage lasted for two hours. We were surprised not to see any ticket from ATLAS.
  • Dashboards: a few Dashboard machines were impacted by the VM reboot problem, but everything restarted without problems and no data was lost, also thanks to the load balancing alias for the message broker.

  • GGUS:
    • There was a system inaccessibility around 10am CET today. It was due to an issue with a backbone router. [Maria asks Dimitri to check with the KIT network people about the exact cause.]
    • Reminder: There will be a meeting tomorrow 2013/02/05 at 2:30pm CET about alternative to personal certificates authentication methods to GGUS and possibly to GOCDB. WLCG position summarised in Savannah:132872#comment6 . Agenda with connection details on https://indico.cern.ch/conferenceDisplay.py?confId=229577 . Room is 28-1-025 (~12 people only but there is audioconf. This was announced in this meeting last Tuesday and in email.
    • WLCG Ops Coord Meeting this Thursday. Please submit GGUS ticket numbers to wlcg-operations@cernNOSPAMPLEASE.ch if support is not satisfactory. MariaD will present them, if no action was taken between now and then.

AOB:

Tuesday

Attendance: local(AndreaS, Rod/ATLAS, Jan, Alexandre, Mark/LHCb, Jerome, Peter/CMS);remote(Michael/BNL, Saverio/CNAF, Xavier/KIT, Wei-Jen/ASGC, Lisa/FNAL, Pepe/PIC, Ronald/NL-T1, Tiju/RAL, Roger/NDGF, Rob/OSG, Rolf/IN2P3).

Experiments round table:

  • ATLAS reports -
    • Load balancing on pandaserver.cern.ch appears to favour certain machines (from the 8)
      • I understand load balancing influence is only to remove busiest machine, and leave 7. Under investigation.
    • xrdcp patch will be left in place as we investigate moving to cvmfs resident client

  • CMS reports -
    • LHC / CMS
      • Starting the final week.
      • p-Nucleus PH run going on smoothly (2 fills yesterday)
    • CERN / central services and T0
      • CMS Tier-0 running ok with partially new WMAgent component (RECOing) and partially old ProdAgent component (for REPACKing)
      • CMS finally moving toward blocking user-reading from CASTOR: warning to be sent to all CMS collaborators this week and action to be finalized by March 1st, 2013. [Jan asks if read access will be blocked only for data on tape or for all CASTOR data: it is for all data. Will contact Stephen Gowdy to decide how to implement the actual blocking.]
      • 2:00 PM today : Instabilities with the CMS Twiki Service. The CERN/IT responsible has been notified and acknowledged an issue.
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR
    • AOB
      • Peter Kreuzer CRC-on-Duty until the end of the Run

  • ALICE reports -
    • CNAF: jobs steadily drained away since yesterday morning, experts looking into it.
    • KIT: now running at full capacity, ~6800 concurrent analysis jobs, so far without problems; it is not yet clear what change allowed this to happen.

  • LHCb reports -
    • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
    • T0:
      • NAGIOS test for swdir at IN2P3 still not running as frequently as others (once in the last 24 hours) (GGUS:91126)
    • T1:
      • RAL: FTS transfers no going through without issue. Not sure what solved it but some tests still to be run.
      • RAL: Seen an increase of SetupProject errors - not a major problem, but any possible reason for this (e.g. AFS decommissioning)? [Tiju is not aware of any CVMFS problem and the AFS decommissioning has not yet started, so it must be something else.]
      • IN2P3: Yesterday and last night had a number of 'Bus errors' reported across many WNs. Problem has gone away now, but we were wondering if there was a possible CVMFS glitch? [Rolf suggests to contact the LHCb contact; it has not yet been done because the problem was observed only this morning.]

Sites / Services round table:

  • ASGC: ATLAS opened a ticket (GGUS:91160) because of job failures in staging output files to CASTOR via lcg-cp. Still under investigation.
  • BNL: ntr
  • CNAF: there is an intervention on GPFS for ATLAS ongoing, as scheduled in GOCDB. It should finish this Friday. [Rod adds that there is another problem at CNAF, on a recently opened long queue which sees the wrong software area. A ticket was sent to the Italian grid support.]
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Dashboards: ntr

AOB:

Wednesday

Attendance: local(AndreaS, Ueda, Peter/CMS, Alexandre, Rod/ATLAS, Jerome, Mark/LHCb, Jan, Maarten/ALICE, Luca);remote(Wei-Jen/ASGC, Matteo/CNAF, John/RAL, Roger/NDGF, Pepe/PIC, Pavel/KIT, Alexander/NL-T1, Rolf/IN2P3, Kyle/OSG, Lisa/FNAL, MariaD).

Experiments round table:

  • ATLAS reports -
    • 3 CERN pilot factories had jobs stuck in submit state to multiple sites, inc. RAL, KIT.
      • Suspect network glitch around 02:30 UTC, but no evidence for that.
    • QMUL-RAL FTS problem back
      • QMUL FINAL:DESTINATION: User belonging to VO pDɪ [John: something similar had happened in April; Andrew thinks to remember that the problem was on the RAL side and Shaun will look into this as soon as he is back, next week.]
    • lcg-cp store to Castor at TW fails since pilot update.
      • Update included the recommended --srm-timeout=3600
      • Good work from Felix, summarized on GGUS:91160 show
        • if srm preparetoput does not respond within 2s, then client waits full hour(srm-timeout value) to ask again
        • with lcg-utils 1.14 and no srm-timeout argument, the behaviour is ok
      • In the 3rd hack, for this WLCG SW problem, ATLAS sends a pilot without the --srm-timeout option to TW [Maarten will make sure that the developers look into this problem as soon as possible.]
      • CERN Castor stores unaffected(uses xrdcp). RAL also unaffected but not clear why not - perhaps always under 2s.
    • Nagios test of WN version required /etc/emi-release - not present on tarball installed sites
      • No reason to look there, and at least LRZ cannot create this file
      • Various tickets open for nagios fix, e.g. use env as base, use emi-version executable in path
        • trivial to fix, so I expect it will take ages
        • 90974, 91211,90768 [Maarten: EGI is aware of this problem and no site will be suspended for not upgrading to the EMI-WN tarball until this and other issues have been fixed.]

  • CMS reports -
    • LHC / CMS
      • p-Nucleus PH run going , only 2 short fills in last 24h
    • CERN / central services and T0
      • Savannah-GGUS-CMS (see Savannah:131565) : CMS will provide feedback in tomorrow's WLCG Ops Coord meeting (or via the ticket), on the proposed development actions from the GGUS side, in view of abandoning Savannah
    • Tier-1:
    • Tier-2:
      • NTR
    • AOB
      • Peter Kreuzer CRC-on-Duty until the end of the Run

  • ALICE reports -
    • CNAF: still no jobs running, being investigated.

  • LHCb reports -
    • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
    • T0:
      • NAGIOS ticket ongoing (GGUS:91126) [Rolf: is it possible that this is our fault? Mark: not likely, the other tests run fine.]
    • T1:
      • NTR

Sites / Services round table:

  • ASGC:
    • This morning the CASTOR nameserver crashed for two hours, due to a disk failure.
    • The problem reported today by CMS is now fixed.
  • CNAF: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: two dCache file servers went down last night due to a kernel panic.
  • PIC: we have received new disk servers. We will prepare a plan for the data migration and we should have the 2013 pledges installed by April. As there was some overpledging, it is possible that there will be a reduction on the disk space.
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage services:
    • this morning we lost one LHCb disk server due to a RAID controller failure; a few files had not yet been migrated to tape. It should be fixed by tomorrow. Mark: those files can be retransferred from the pit, don't bother if recovering them from the disk is too time-consuming.
    • A new ATLAS regional redirector for the Spanish cloud has been set up.
    • We will stop the tests against the CMS global redirector because they cause authentication errors at some sites (e.g. Nebraska).
  • Dashboards: ntr
  • Databases: ntr

AOB:

  • MariaD advises against reporting at tomorrow's WLCG operations coordination meeting on the discussions between CMS and GGUS for the Savannah to GGUS migration: this is because these discussions are far from concluded. It is better to wait for the next meeting.
  • MariaD asks if CMS wants the ticket GGUS:91055 to be escalated at the coordination meeting tomorrow or escalated via GGUS so that the CERN service managers can take proper action. Peter will look at the ticket and decide.

Thursday

Attendance: local(AndreaS, Jerome, Peter/CMS, Alexandre, Rod/ATLAS, Mark/LHCb, Jan, Maarten/ALICE, Ueda);remote(Wei-Jen/ASGC, Matteo/CNAF, Ronald/NL-T1, Lisa/FNAL, Jeremy/GridPP, Gareth/RAL, Roger/NDGF, Woo-Jin/KIT, Pepe/PIC).

Experiments round table:

  • ATLAS reports -
    • Production backlog in NL cloud
      • Reco of SARA resident data
      • Get 1000 slots at SARA. Also using T2 but constrained by input data transfer
      • 2000 slots at NIKHEF (technically T1) also constrained by ddm/fts
        • Can read direct from SARA, but stopped doing this in the past(overloaded gbit connection)
      • Question: what is the network pipe nikhef-sara? Ronald: 20 Gbps

  • CMS reports -
    • LHC / CMS
      • LHC run was extended until Thursday Feb 14, 6AM, in particular a low energy pp run (specially requested by CMS) between Sunday morning and Thursday
    • CERN / central services and T0
      • CMS Tier-0 is ready for the pp run
      • IMPORTANT : CMS and CERN/IT have scheduled a update of EOSCMS on Thu Feb 14th. Given the run extension, may we postpone this update to the following week ? Jan: no problem.
      • Same question as above for the rolling Oracle intervention scheduled for the 12th (https://hypernews.cern.ch/HyperNews/CMS/get/database/1128/1/1.html). Could CMS also ask a delay of this intervention to Feb 16th, earliest ?
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR
    • AOB
      • Peter Kreuzer CRC-on-Duty until the end of the Run

  • ALICE reports -
    • CNAF: ALICE VOBOX daemons were restarted and jobs started appearing again, but the issue was not understood.

  • LHCb reports -
    • Ongoing activity as before: reprocessing, prompt-processing, MC and user jobs.
    • T0:
      • NAGIOS problem seems to be the result of malformed output from this test. Switched to IN2P3. (GGUS:91126)
    • T1:
      • RAL: Since yesterday, no jobs have been run (GGUS:91251) [Gareth: for some still unknown reason the batch scheduler was starting jobs at a very low rate and the LHCb jobs were not being picked up to run but in the last few hours it looks OK.]

Sites / Services round table:

  • ASGC: about the lcg-cp problem, we have opened a ticket to the developers (GGUS:91223).
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: about the CMS ticket on authorisation problems for the t1production role: the problem has been solved, but late because of an issue in the GGUS-SNOW bridge: SNOW thought the ticket was "waiting for user" but GGUS did not.
  • CERN storage services: in the last few days about 50% of the CMS files on EOSCMS disappeared: is that just an intended file deletion? Peter: yes, to recover some space for new data. It is agreed that it is desirable to warn in advance about massive deletion campaigns in order to distinguish them from "unintended" deletions (like the incident of a few months ago).
  • Dashboards: ntr

AOB:

  • Today at 15:30 CET there is a WLCG operations coordination meeting (agenda). Connection is via Vidyo, details in the agenda page.

Friday

Attendance: local(AndreaS, Kate, Luc/ATLAS, Peter/CMS, Mark/LHCb, Jerome, David, Jan, Maarten/ALICE);remote(Wei-Jen/ASGC, Matteo/CNAF, Xavier/KIT, Onno/NL-T1, Michael/BNL, Pepe/PIC, Christian/NDGF, Gareth/RAL, Rob/OSG, Lisa/FNAL).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • NTR
    • Tier1s
      • NIKHEF-ELPROD_DATADISK full. Taken out of T0 export & deletion ongoing
      • PIC SRM failures due to large deletion at IFAE. GGUS:91299. Fixed
    • Tier2s
      • NTR

  • CMS reports -
    • LHC / CMS
      • CMS ready to switch to pp running mode on upcoming Sunday morning
    • CERN / central services and T0
      • problems with CMS Tier0 db instance, see INC:236206 : effect is that a few queries are crashing some CMS Tier0 software components, always with the same error : DatabaseError:(DatabaseError) ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS2' . Investigations are on-going. [Kate: trying to understand the cause, but would like to point out that CMS is using an integration database for production!]
    • Tier-1:
      • T1_ES_PIC : SUM SRM SAM tests failing, see GGUS:91296 . Local admins working on it, apparently due to another VO making massive use of storage resources. [Peter adds that the PIC contact mistakenly closed the Savannah ticket instead of the bridged GGUS ticket]
      • T1_TW_ASGC : 1 CE (cream05) not submitting CMS jobs since 2 days, see Savannah:135734. Was originally thought to be a hardware issue which has been addressed, however since then a new problem appeared: when sending pilot jobs the mapping of DNs to local users in the worker nodes fails. Could it be a local proxy setting issue, that was changed during the HW check? [Wei-Jen: experts are investigating.]
    • Tier-2:
      • NTR
    • AOB
      • Peter Kreuzer CRC-on-Duty until the end of the Run

  • ALICE reports -
    • NIKHEF: freshly installed EMI-2 CREAM CEs turned out to be again affected by the proxy delegation bug causing all normal ALICE jobs to abort. NIKHEF quickly applied the documented workaround (see GGUS:91277), thanks! GGUS:91279 was opened to alert the CREAM developers and get the documentation fixed.

  • LHCb reports -
    • T0:
      • NTR
    • T1:
      • IN2P3: NAGIOS problem still being investigated (GGUS:91126)
      • RAL: Problem with scheduler resurfaced overnight (GGUS:91251) [Gareth: the problem is not yet understood and it affects all experiments. For now we can just restart the job scheduling by hand.]
      • PIC: Had problems with SRM timing out. Identified as a single problematic user which was then banned.

Sites / Services round table:

  • ASGC: Sunday is the start of the Chinese New Year celebrations, support will be available only during daytime until Mon Feb 18.
  • BNL: a superstorm will hit the East Coast later today; we will not shutdown any service but interruptions in the electrical power are possible. If BNL services became unreachable, this will be the likely cause.
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: scheduled downtime tomorrow to move the Finnish pools back to the LHC OPN. It should not last more than 1-2 hours.
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • CERN batch and grid services: ntr
  • CERN storage services: scheduled "at risk" intervention for February 21 for an intervention, in theory transparent, on the CASTOR database.
  • Dashboards: ntr
  • Databases: the intervention on the CMS integration database has been rescheduled for February 18, after the end of the run.

AOB:

  • Next Monday and Tuesday the Alcatel phone conference system will be unavailable. To connect to the phone conference, call the +41-22-7677000 and ask the operator for the "WLCG operations" conference call, owned by Andrea Sciaba'. The conference is booked from 14:50 to 15:30. Unfortunately callback from CERN is not possible.
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2013-02-08 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback