Week of 080922

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Daniele, Simone, Harry, Ewan, Sophie);remote(Gareth).

elog review:

Experiments round table:

ATLAS (SC): The ATLAS computing board has decided to no longer automatically distribute cosmics data to the Tier 1 but only on specific request. There is currently a problem reading/writing data to SARA where the SRM is not responding.

CMS (DB): T0 workflows : HLT key transmission from StorageManager seems to work fine now. Long running repack jobs for more than 11 hrs on Saturday. Long running prompt reco jobs for more than 16 hrs (rate ~8 kHz). Queue used more than before (22 repack, 1 repack-merge, 206 prompt reco jobs altogether) --- Lemon monitoring system was 0% available for some time, actually at 11h30min to 12h45min and 13h10 until 14h10min ( http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000552091&email=mafd@mail.cern.ch )

T1 workflows : T0-T1 transfers of Cosmics RAW 100% completed (1447 files, 3.86 TB) at CERN_MSS + CAF + T1_FNAL/FZK/IN2P3

T2 workflows : there is a time-correlated pattern of failures of the bdII-visibility test, as from the SiteStatusBoard -- http://lxarda16.cern.ch/dashboard/request.py/siteviewhistory?columnid=1 at PIC and seven Tier 2 sites. There is a job proxy expiration issue, which also caused JR jobs to fail with most T2 sites experiencing JR failures due to: Got a job held event, reason: Globus error 131: the user proxy expired (job is still running) Job got an error while in the CondorG queue. Job proxy is expired. This was observed on Friday already.

LHCb: Dummy MC production at Tier 2 is going smoothly apart a problem at Dortmund where they cleanup the workspace too soon. Stripping jobs at CERN are suffering with 5000 jobs waiting and only 24 running. A ticket will be raised. Also PIC shows a problem where it does not pickup pilot jobs even though it shows low queues. Sophie proposed to LHCb to start to migrate their LFC srm1 entries to srm2 entries in groups of 1000 (there are about 1.5 million entries). This was agreed (she will now prepare scripts). She also asked if CERN should keep open the 5 srmv1 and 5 srmv2 endpoints and there was a discussion on whether LFC activity should be reduced during the migration to avoid overloads. One idea was to redefine srmv1 endpoints to point, in fact, to srmv2 services. Another was to have more read-only replicas at the Tier 1. Discussions to continue.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Simone,Harry,Miguel Anjo);remote(Gareth, Gonzalo).

elog review:

Experiments round table:

ATLAS (SC): There is a failure writing to the CASTOR Tape service class at CNAF. Simone will check for a GGUS ticket. ATLAS still see inneficiencies in FTS transfers CERN to FZK - this is being debugged by both ends.

Sites round table:

PIC: suffered a network failure of primary and backup links from 23.30 last night due to a data centre incident in Madrid. Backup connectivity was restored at about 11.30 today. A PostMortem report will be made.

Core services (CERN) report:

DB services (CERN) report: ATLAS have requested to the Tier 1 DBAs to urgently change a parameter in their streams replicated databases to allow longer queries not to break. The CERN group prefer to wait to understand better the right value to use and to schedule the changes. SC said ATLAS would be discussing their database access problems tomorrow and would probably make an official request on Thursday so CERN-Data Management should wait for that.

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Daniele,Simone,Jean-Philippe,Harry,Miguel A,Sophie);remote(Gareth,Michel).

elog review:

Experiments round table: All will be rethinking their plans following the recently announced accelerator delay with no beam till Spring 2009.

LHCb (RS): A large number of files have been lost at RAL on the CASTOR system. These files were in the lhcbDst (d1t0) class and were lost due to another Oracle bug which caused a synchronisation problem. Further investigation by the CERN and RAL castor teams is going on and they will give LHCb a list of the files lost as soon as they have them. Gareth (RAL) confirmed investigations are ongoing.

CMA (DB): This is a CMS week and they are stopping shifts for a couple of days thinking how to proceed before restarting them. They would like to stay in operations mode and maintain some computing load. No particular problems at the moment.

ATLAS: They will have a meeting tomorrow to decide on scaling down computer operations shifts - probably go from 2/day for 7 days to 1/day for a 5 working-day week. They will continue functional tests and try to keep the infrastructure running as now. They expect to complete automation of functional tests of the production system next week and are preparing them for analysis work. They will keep taking cosmics till at least the end of October since all detector elements are in place then perhaps open the detector if this becomes scheduled.

Sites round table:

CERN: last evening sent out a press release that there would be no further LHC beam until Spring 2009.

Core services (CERN) report:

DB services (CERN) report: BNL had to restart their ATLAS conditions database at midday (CEST) today. CNAF also plan to restart their ATLAS cluster at 15.00 CEST today while ASGC are restarting at 01.00 CEST to correct a character set mis-installation.

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local(Harry, Ewan, Jean-Philippe, Gavin, Daniele, Simone, Miguel A);remote(Gareth, Gonzalo).

elog review:

Experiments round table:

CMS (DB): They will restart the operational environment as before with global cosmics runs Wednesday and Thursday. There will be runs with field on. Shifts will be redesigned to match.

ATLAS (SC): Also going back to previous mode with detector tuning and 10% functional tests in the week and cosmics at the weekends. Thinking about performing a reprocessing to test prestaging and database access at the Tier 0 and Tier 1.

Sites round table:

PIC (GM): Yesterday's primary network link failure was due to a major incident in a Madrid data centre. A post-mortem analysis will be prepared.

Core services (CERN) report (GM): SLC4 FTS services have been installed at CERN and final tests are being made before making them available to experiments.

DB services (CERN) report (MA):Recent Oracle patches have been applied on the validation DBs and this will now be scheduled to be applied to the production DBs over the next 2 weeks. The streams restarts at BNL and CNAF worked correctly. BNL also upgraded to Oracle 10.2.0.4 while CNAF stayed at 10.2.0.3 and are reporting some problems. We believe the LHCb data loss at RAL reported yesterday is due to an Oracle bug and we have advised RAL of two patches they could deploy.

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local(Simone, Miguel Anjo, Jean Philippe, Patricia, Ewan Roche);remote(Luca-CNAF, Gonzalo-PIC, Smith-RAL).

elog review:

Experiments round table:

LHCb: Report by email from Roberto:

  • CNAF Castor (after the downtime) looks to be still in a bad shape. They have extended the unscheduled downtime now

  • All interactions with GRIDKA SRM are completely unsuccessful. Transfers with FTS are failing, removal of files is failing and obtaining metadata is timing out.

  • All transfers from disk at RAL are failing with errors: SOURCE error during PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare source file in 300 seconds

  • PIC recovered faulty tape files preventing to finalize a stripping production.

Alice: in the last 2 days there is a problem with Alice central service. Production stopped and is just ramping up now. There have been issues usign the WMS at RAL (authorization).

ATLAS: getting the reprocessing working is the main focus now. Prestaging tests have been started this morning (CERN). RAL will follow in the afternoon, waiting for CNAF to recover. Condition Database task force will meet today and present a work plan next week.

Sites round table:

CNAF: access via RFIO works OK, but the transfers via SRM fail in the PutDone. This is true for both SMR1 and SRM2 endpoints. The idea now is that the problem is in a SQL procedure.

RAL: problem in the synchronization between name space and content of the pool. Waiting for a callback from Oracle.

Core services (CERN) report:

DB services (CERN) report: CMS online database was not reachable from yesterday night till today at lunch. Network problem. Solved.

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-09-26 - SimoneCampana
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback