Week of 080908

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Andrea, Daniele, Roberto, Jean-Philippe, Julia, Jan, Miguel, Maria);remote(Derek, Jeremy, Michael).

elog review: #460

Experiments round table:

  • CMS (Daniele): cosmics still rolling from Cessy to Tier-0 and Tier-1 sites; data corruption in some disk servers, temporarily bypassed by writing on the local disks of the machines running the Storage Manager; effort to strengthen the processes and the communications involving online and offline (see #460).
  • LHCb (Roberto): both past and current week are generally quiet, but there are a few problems:
    • found a bug in the LHCb SAM "test" which installs LHCb software on sites, determining a corruption of the software area; the problem is going to be fixed soon, but expect some critical test failures leading to low availabilities at affected sites;
    • MC production: stripping jobs are failing due to crashes of the DaVinci software, which are being investigated;
    • "CCRC-like" production: test jobs were sent to Tier-1 sites, succeeding everywhere but at NIKHEF. Several production jobs hanging at CERN, which Miguel explains being due to them trying to open too many connections.

Sites round table: no report

Core services (CERN) report:

  • Miguel reports about a new bug in CASTOR, affecting all instances, which might determine a data corruption in a daemon (not yet observed). The fix was already deployed in the public instance. Concerning the experiment instances, it has been agreed to deploy the fix tomorrow morning (it should be totally transparent) and to send out a description of the bug and the intervention today.

DB services (CERN) report:

  • Maria reports about a rolling intervention (kernel upgrade) today on the PDBR (affecting also ALICE) and the CMSR (the CMS offline database). Tomorrow the upgrade will be performed on the WLCG database server and the ATLDSC and LHCBDSC downstream databases.

Monitoring / dashboard report: no report

Release update: no report

AOB: none

Tuesday:

Attendance: local(Daniele, Simone, Roberto, Patricia, Jean Philippe, Miguel Dos Santos, Maria, Julia, Gavin);remote(Michael, Derek).

Experiments round table:

  • CMS (Daniele): Transparent intervention at CASTOR@CERN. Yesterday's problems:
    1. T0 Monitoring system resulted in problem at a VOBOX
    2. From sometime in the afternoon problem in the SLS monitoring for the LSF queues. Ticket has been submitted, problem has been solved but CMS would like clarifications.
  • LHCb (Roberto): running stripping activities. Yesterday many problems with jobs failing (the casuse was Dirac configuration services being down). CCRC08-like exercise: sending test jobs to check CondDB access and direct POSIX access to storage (CNAF and PIC are OK, CERN is in pending status, all other sites fail). Dummy Monte Carlo production: problem at the Dirac "optimizer" preventing to submit jobs only T2 and T3.
  • ALICE (Patricia): new Alien version, being upgraded at every site. Sites have been informedm, they can upgrade their own and should do it. In case, ALICE central operation can help.
  • ATLAS (Simone): Some problems exporting data from CERN to T1s, due to missing files in CASTOR. See Miguel' report below

Sites round table:

  • BNL (Michael):
    1. Licence problem at the load balancing in front of Panda Server.
    2. The network provider would like to run intrusive test and bring down the primary CERN-BNL link. BNL feels very uneasy with this, especially since timescale has not been announced nor discussed with BNL

Core services (CERN) report:

  • Miguel: today's transparent upgrade to CASTOR, no problem observed. Network problem yesterday which affected CASTOR instance. Degradation between 7:00 and 10:30.

DB services (CERN) report:

  • Completed the upgrade for Update7 RH4 Problem at SARA, causing degrade for streams for ATLAS. Might affect also LHCb since this is a single instance of the DB.
  • Daniele: the shifter yesterday observed some impact of the upgrade not only for the offline but also the online. Daniele will send details to Maria.

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

thumbs up LHC First Beam day! - http://lhc-first-beam.web.cern.ch/lhc-first-beam/Welcome.html

Attendance: local(Julia, Steve, Harry, Oliver, Maria, Michel, Gavin, Roberto);remote(Derek, Michael,Simone,Kors and many from Atlas distributed computing).

elog review:

Experiments round table:

ATLAS (SC): First beam-gas data being taken but will be in a long run since it takes an hour to start a new run. Plan to close the run tonight after which distribution and processing can start. Will be a large but not enormous data set. All Tier 1 will receive a complete copy. Simone reported back on a question raised yesterday by M.Ernst following some 20 files in CASTOR found to be corrupted (zero length) before migration to tape. ATLAS have a handshake procedure where files are not deleted from the SFO before they are checked to be migrated to tape and if needed they can be resent from the SFO. This mechanism worked yesterday and the correct raw data files were resent.

LHCb: running Stripping production from old DC06 (prod. 3030) and testing stripping for CCRC Workflow with/without access on CondDB. Also commissioned a fake MC simulation production for keeping its T2/T3 centers warmed and running. This is a breakdown of T0 and T1 activities from elog http://lblogbook.cern.ch/Operations/541

CERN : a lot of jobs are still staging, but should be OK, yesterday more then 200 jobs have completed successfully while less than 30 have failed (dv seg fault).

CNAF : OK, 1% of the he stripping jobs (3030) failed. passed reconstruction tests (with and without condDB prod 3008/3009)

GRIDKA : 2/3 of the stripping jobs are completed successfully, passed reconstruction test without condDB andm 1 (of 5) job passed it with the condDB (elog 526)

NIKHEF : Problem in the reconstruction job tests (elog 524-532-540). Still gsidcap problem. xroots does not help so much. new dcache version would fix all instabilities

IN2P3 : OK, 50 jobs (13%) have failed in the stripping prod (DV seg fault, elog 535, 537), passed rec test without condDB(3009), but failed with the condDB (3008)

PIC : OK, 2 stripping jobs completed suc. (sub more 400 jobs, elog 539), passed rec test with and without condDB

RAL : passed the rec test only without the condDB

Sites round table:

CERN: First complete turn clockwise at 10:26 and anti-clockwise at 15:02. Many congratulations to the LHC teams.

Core services (CERN) report: There was a 10 minute interruption of the VOMRS service this morning.

DB services (CERN) report: The current ATLAS prestaging/reconstruction stress test using FDR and cosmics data caused an overload of the ATLAS-COOL database at NIKHEF yesterday with ATLAS jobs being diverted to FZK. Since resources are shared with LHCb their service was also impacted. The situation is stable now but the level of load will reflect future needs.

Monitoring / dashboard report:

Release update: No releases today.

AOB: No meeting tomorrow due to CERN holiday.

Thursday

"Jeune Genevois" - CERN closed smile

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local(Simone, Roberto, Maria, Jan, Harry, Olof, Daniele);remote(Derek, Michael).

elog review:

Experiments round table:

CMS (DB): There have been recovery problems with the cmsmon machine following a security issue yesterday. It has been reconfigured for secure http. First beam data has been sent to CERN MSS, CAF, IN2P3 and FNAL while cosmics are going to CERN MSS and CAF. However, this distribution is very dynamic. There are some Tier 1 issues at ASGC and PIC (now fixed) and there is a high load on the Tier 2 DPM server at the Bristol Tier 2 that is under investigation.

ATLAS (SC): Data from the first beams day has been shipped, exceptionally, to all Tier 1 and CAF. Three issues: 1) in the current ATLAS model a data-set is only created, hence files from it can be exported, at the end of a run. Since stopping and starting a run takes an hour we ran first beam day as one run only so could not start any export till the evening. We need a better solution for this. 2) After export started we saw problems with export of a particular few files and sent in a GGUS alarm ticket. The handling of the ticket did not work as it should have done so there is some follow-up to be done. See https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20080910 . The problem was in fact that users were also accessing this data from the default castor ATLAS pool and this is the first one checked for export and all its access slots were taken so the export request waited in a queue. 3) the activity of migration of user data from the atldata pool to the CAF using FTS overloaded the servers and datasets started piling up. Some tuning of the CERN-CERN FTS parameters needs to be made.

LHCb (RS): A few hundred jobs failed at PIC with a staging timeout after 24 hours. Testing of CCRC like activity continues across all Tier 1 with and without conditions DB access. All tests without such access are running well but tests with access are failing at IN2P3 and RAL. Today they will start some dummy MC simulation to keep small Tier2 and 3 sites active.

Sites round table:

BNL (ME): The primary OPN link failed last night around 22.00 EDT when a fibre bundle was cut on Long Island. They performed a successful manual failover to the secondary link which worked perfectly till around 05.00 EDT then started giving transfer timeouts on data connections to most Tier 1 sites (but not to CERN). The primary circuit has now been restored and there will be a conference call with CERN to advance on how to automate such failovers.

RAL (DR): Had LSF overload issues that were causing problems to ATLAS. They scaled down the load to restore stability.

Core services (CERN) report: Following the handling of the ATLAS GGUS operator alarm ticket on Wednesday we are following up on the workflow. There was, for example, a wrong Simba permission which stopped an SMS being sent.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: LHCb reported that the CASTOR information provider at RAL gives different information than the providers at other sites. Later follow-up (with F.Donno) was that up to now sites have been independent but a common provider is being prepared.

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-09-12 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback