Week of 091019

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

from OSG related to this ticket yet.

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jamie, Harry, Jean-Philippe, Patricia, Roberto, Andrew, Simone, David, Gang, Olof, Dirk, MariaD, MariaG, Alessandro, Julia);remote(Gonzalo, Michael, Ron, Angela, Gareth).

Experiments round table:

  • ATLAS (Simone)- loss of files at 2 sites: NLT1 (180 files lost after a drive destroyed a tape) and RAL. They provided a list of 200k files that need to be cleaned from all catalogues. JPB will provide an SQL query to get a list of lost files from the LFC to crosscheck. The Lyon "hot disk" endpoint is failing all the time. A ticket has been opened as this space contains condition data and DB releases.

  • ALICE (Patricia)- beginning the migration of the VO boxes to SLC5 and gLite 3.2. At CERN Alice is using mainly SLC5. ce130 and ce131 are ok (2000 jobs now). cream202 had problems this morning.

  • LHCb reports (Roberto)- Very successful MonteCarlo run during the weekend. From Friday around 19:00 til Sunday afternoon there was a plateau. For the disk server move to SLC5, LHCb prefers to have a single slot for all disk servers in a pool. Agreed to have an 8 hours intervention for that on 29th October. There is most probably a problem in the ranking expression used in Dirac as CNAF reports too little activity and KIT too much. 60 files not accessible at CNAF. Space token full at KIT and PIC. KITwill add 10TB this afternoon. Nothing will be added at PIC as PIC provides already more resources than pledged.

Sites / Services round table:

  • Gonzalo (PIC): NTR
  • Michael (BNL): NTR
  • Ron (NLT1): Tomorrow, Oracle RAC will be moved to a new infrastructure and Sara will move to SL5.
  • Angela(KIT): Too many jobs sent by LHCb over the weekend (20k). Needed babysitting. Better to do this outside of the weekend. LHCb will work on improving ranking algorithm.
  • Gareth (RAL): nothing more to report today
  • Gang (ASGC): DB supposed to be resynced this week but Maria says that the DB is still not in good shape. Maria will send a mail to Jason with the error message.

  • Olof: Sophie wants to apply a patch from Massimo to the CREAM CE.
  • MariaG: migration of Atlas DB to new hardware is being done. LHCb DB was done in the morning. CMS DB will be done tomorrow and WLCG on Wednesday. The problem on Saturday was due to a problem at the pit.
  • MariaD: will send a mail to OSG: one team ticket (GGUS 52283) was not picked up by OSG.
  • Julia: the Dashboard DB for LHCb will be moved tomorrow.

AOB:

Tuesday:

Attendance: local(Daniele, Gang, Wei, Olof, Sophie, Jean-Philippe, Harry, Jan, Gavin, Jamie, MariaG, Andrew, Julia, Ricardo, Simone, Flavia, MariaD, Patricia, Roberto);remote(Gonzalo, Ronald, Michael, Angela, Gareth, Jeremy, Fabio).

Experiments round table:

  • ATLAS (Simone)- 2 T1s are down for less than 12 hours: PIC and NDGF; so they were not taken out of production; this of course produces failures. UAT (User Analysis) has been postponed to 28-30th October as not enough users were available this week. Agree to have the SRM upgrade at CERN tomorrow.

  • CMS reports (Daniele)- The CMS Analysis "October Exercise" has been officially declared over on Oct 19th, 9am GVA time. There are some issues installing the software at CNAF (not really understood) and IN2P3 (rpm db corruption). Both problems are being addressed. There are also problems at RAL (being addressed) and FNAL (CMS local contact informed). There are also permissions problems at a Russian Tier2 and corrupted files at Caltech. Finally there are some transfer issues between Tier2s and Tier1s.

  • ALICE (Patricia)- not in full production today, no issue to report. Registration of VO boxes for the proxy renewal service should be addressed to px.support@cernNOSPAMPLEASE.ch.

  • LHCb reports (Roberto)- About 10k jobs for simulation and user analysis. No major issue at CERN. dCache being upgraded at PIC. Problem with the publication of CEs at IN2P3 and KIT; does not fully explain problems seen last weekend, so LHCb is working on a smarter ranking algorithm. 60 files are in inconsistent state at CNAF, so they cannot be accessed nor deleted.

Sites / Services round table:

  • Fabio (IN2P3): The problem of incorrect publication of the CEs was related to the resources on SL5. Currently testing the dCache golden release. If everything ok, plan to put it in production on 9th November.
  • Gonzalo (PIC): Nothing to report
  • Ronald (NLT1): Oracle 3D migrated to new hardware. Still a few problems in accessing some channels: work in progress. Sara migration to SL5: work in progress.
  • Michael (BNL): Nothing to report. Just a comment: the GGUS ticket mentionned by Maria yesterday was not for BNL but for Boston University (problem being addressed).
  • Angela (KIT): not convinced that the incorrect publication in BDII was the cause of the overload during last weekend. Working on a fix to correctly publish information even in case of high load.
  • Gareth (RAL): still investigating both the hardware issue with the disk servers and the DB recovery problems. Catalin working on getting from LFC the list of CASTOR lost files.
  • Jeremy (GridPP): Nothing to report
  • Gang (ASGC): introduced a new person (Wei) who will participate in the Atlas and WLCG daily meetings representing ASGC.

  • Jan: SRM upgrade at CERN on 29th October.
  • Sophie: LFCs (except for LHCb), VOMS and FTS services will be down tomorrow morning for 90 minutes for a DB upgrade.
  • Julia: the Dashboard migration that took place today went well.
  • MariaG: successful migration of CMS cluster today. WLCG cluster will be done tomorrow. Resync of Atlas, LHCb and LFC DBs after the intervention at Sara. DB still not accessible at ASGC (probably a configuration problem)

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

Release report: deployment status wiki page

AOB:

  • Conference call problems: Apologies for those who had problems to connect. The “usual trick” of resetting the conference details seems to have worked.

  • (MariaDZ) News from Arvind Gopu (OSG) on Monday's issue:
We have assigned the appropriate GOC ticket (7654) to the GGUS ticket
52283. We apologize for the late response but the GGUS ticket appears to
have triggered a GOCticket creation but somehow the GOC ticket never
finished. I am unable to find logs dating back to Oct 12th at this time,
so I can't debug any further.
Hopefully this winter, with a newer ticket exchange mechanism in place,
we will have fewer problems like this.

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

  • Most European countries return to Winter time on Sunday 25th October: http://www.timeanddate.com/time/dst2009.html.
    • The US and Canada "fall back" on Sunday 1 November.
    • Taiwan: "no DST in 2009".
  • These meetings continue at 15:00 Geneva time!

-- JamieShiers - 14-Oct-2009

Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009-10-21 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback