Week of 080908

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Andrea, Daniele, Roberto, Jean-Philippe, Julia, Jan, Miguel, Maria);remote(Derek, Jeremy, Michael).

elog review: #460

Experiments round table:

  • CMS (Daniele): cosmics still rolling from Cessy to Tier-0 and Tier-1 sites; data corruption in some disk servers, temporarily bypassed by writing on the local disks of the machines running the Storage Manager; effort to strengthen the processes and the communications involving online and offline (see #460).
  • LHCb (Roberto): both past and current week are generally quiet, but there are a few problems:
    • found a bug in the LHCb SAM "test" which installs LHCb software on sites, determining a corruption of the software area; the problem is going to be fixed soon, but expect some critical test failures leading to low availabilities at affected sites;
    • MC production: stripping jobs are failing due to crashes of the DaVinci software, which are being investigated;
    • "CCRC-like" production: test jobs were sent to Tier-1 sites, succeeding everywhere but at NIKHEF. Several production jobs hanging at CERN, which Miguel explains being due to them trying to open too many connections.

Sites round table: no report

Core services (CERN) report:

  • Miguel reports about a new bug in CASTOR, affecting all instances, which might determine a data corruption in a daemon (not yet observed). The fix was already deployed in the public instance. Concerning the experiment instances, it has been agreed to deploy the fix tomorrow morning (it should be totally transparent) and to send out a description of the bug and the intervention today.

DB services (CERN) report:

  • Maria reports about a rolling intervention (kernel upgrade) today on the PDBR (affecting also ALICE) and the CMSR (the CMS offline database). Tomorrow the upgrade will be performed on the WLCG database server and the ATLDSC and LHCBDSC downstream databases.

Monitoring / dashboard report: no report

Release update: no report

AOB: none

Tuesday:

Attendance: local(Daniele, Simone, Roberto, Patricia, Jean Philippe, Miguel Dos Santos, Maria, Julia, Gavin);remote(Michael, Derek).

Experiments round table:

  • CMS (Daniele): Transparent intervention at CASTOR@CERN. Yesterday's problems:
    1. T0 Monitoring system resulted in problem at a VOBOX
    2. From sometime in the afternoon problem in the SLS monitoring for the LSF queues. Ticket has been submitted, problem has been solved but CMS would like clarifications.
  • LHCb (Roberto): running stripping activities. Yesterday many problems with jobs failing (the casuse was Dirac configuration services being down). CCRC08-like exercise: sending test jobs to check CondDB access and direct POSIX access to storage (CNAF and PIC are OK, CERN is in pending status, all other sites fail). Dummy Monte Carlo production: problem at the Dirac "optimizer" preventing to submit jobs only T2 and T3.
  • ALICE (Patricia): new Alien version, being upgraded at every site. Sites have been informedm, they can upgrade their own and should do it. In case, ALICE central operation can help.
  • ATLAS (Simone): Some problems exporting data from CERN to T1s, due to missing files in CASTOR. See Miguel' report below

Sites round table:

  • BNL (Michael):
    1. Licence problem at the load balancing in front of Panda Server.
    2. The network provider would like to run intrusive test and bring down the primary CERN-BNL link. BNL feels very uneasy with this, especially since timescale has not been announced nor discussed with BNL

Core services (CERN) report:

  • Miguel: today's transparent upgrade to CASTOR, no problem observed. Network problem yesterday which affected CASTOR instance. Degradation between 7:00 and 10:30.

DB services (CERN) report:

  • Completed the upgrade for Update7 RH4 Problem at SARA, causing degrade for streams for ATLAS. Might affect also LHCb since this is a single instance of the DB.
  • Daniele: the shifter yesterday observed some impact of the upgrade not only for the offline but also the online. Daniele will send details to Maria.

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

thumbs up LHC First Beam day! - http://lhc-first-beam.web.cern.ch/lhc-first-beam/Welcome.html

Attendance: local(Julia, Steve, Harry, Oliver, Maria, Michel, Gavin, Roberto);remote(Derek, Michael,Simone,Kors and many from Atlas distributed computing).

elog review:

Experiments round table:

ATLAS (SC): First beam-gas data being taken but will be in a long run since it takes an hour to start a new run. Plan to close the run tonight after which distribution and processing can start. Will be a large but not enormous data set. All Tier 1 will receive a complete copy. Simone reported back on a question raised yesterday by M.Ernst following some 20 files in CASTOR found to be corrupted (zero length) before migration to tape. ATLAS have a handshake procedure where files are not deleted from the SFO before they are checked to be migrated to tape and if needed they can be resent from the SFO. This mechanism worked yesterday and the correct raw data files were resent.

LHCb: running Stripping production from old DC06 (prod. 3030) and testing stripping for CCRC Workflow with/without access on CondDB. Also commissioned a fake MC simulation production for keeping its T2/T3 centers warmed and running. This is a breakdown of T0 and T1 activities from elog http://lblogbook.cern.ch/Operations/541

CERN : a lot of jobs are still staging, but should be OK, yesterday more then 200 jobs have completed successfully while less than 30 have failed (dv seg fault).

CNAF : OK, 1% of the he stripping jobs (3030) failed. passed reconstruction tests (with and without condDB prod 3008/3009)

GRIDKA : 2/3 of the stripping jobs are completed successfully, passed reconstruction test without condDB andm 1 (of 5) job passed it with the condDB (elog 526)

NIKHEF : Problem in the reconstruction job tests (elog 524-532-540). Still gsidcap problem. xroots does not help so much. new dcache version would fix all instabilities

IN2P3 : OK, 50 jobs (13%) have failed in the stripping prod (DV seg fault, elog 535, 537), passed rec test without condDB(3009), but failed with the condDB (3008)

PIC : OK, 2 stripping jobs completed suc. (sub more 400 jobs, elog 539), passed rec test with and without condDB

RAL : passed the rec test only without the condDB

Sites round table:

CERN: First complete turn clockwise at 10:26 and anti-clockwise at 15:02. Many congratulations to the LHC teams.

Core services (CERN) report: There was a 10 minute interruption of the VOMRS service this morning.

DB services (CERN) report: The current ATLAS prestaging/reconstruction stress test using FDR and cosmics data caused an overload of the ATLAS-COOL database at NIKHEF yesterday with ATLAS jobs being diverted to FZK. Since resources are shared with LHCb their service was also impacted. The situation is stable now but the level of load will reflect future needs.

Monitoring / dashboard report:

Release update: No releases today.

AOB: No meeting tomorrow due to CERN holiday.

Thursday

"Jeune Genevois" - CERN closed smile

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2008-09-10 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback