Week of 080818

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Simone, Jean-Philippe, Harry, Luca, Ricardo);remote(Gonzalo, Derek, Daniele, Michael, Jeff).

elog review:

Experiments round table:

ATLAS (SC): Had 3 rather critical problems over the weekend. 1) Many Castor errors trying to export reprocessed data due to its being on tape not disk. These data were garbage collected before older data whereas we thought GC was a FIFO operation. We will follow up with CASTOR operations. Despite this all cosmics were exported (except to RAL) though at low efficiency and DAQ ran into CASTOR at 750 MB/s in 12 hours slots. 2) Affecting all of the Tier 1 is that during the weekend the data set type name embedded at the beginning of each file changed from data08_cos to data08_cosmag without advance warning to allow Tier 1 sites to switch to a new storage directory mapping. We will follow up this poor communication with the ATLAS management. 3) RAL had several problems over the weekend . On Saturday their SRM Oracle data base was giving errors then on Sunday they were not accepting FTS data from CERN though they were still exporting. We found the CERN channel set to 0% and it would have been good to tell us that had been done. Today I can see RAL is in an unscheduled downtime. D.Ross explained their CASTOR was down from 07.00 to 09.00 Saturday then Sunday they had first an LSF disk full (which stops staging) then a database disk full which lead to the unscheduled downtime. The expert called out set the CERN-ATLAS FTS channel share to zero as it was the only one active. H.Renshall said we should follow up how sites could indicate such a configuration change (e.g. site statusTwiki).

CMS (DB): CMS have started the CRUZET4 (cosmics but with magnet on) run with a couple of subdetectors in so far. They have a dataops shift in place concentrating on Tier 0 workflows.They have daily meetings at 16.00 and will use the CCRC08 elog for general (unstructured) observations and their cruzet3 elog for more detailed reports. H.Renshall invited them to make relevant observations to these minutes.

Sites round table: Jeff (NL-T1) reported they are at risk today and tommorow while they change network routers. Will be application transparent unless a cable switch over exceeds the tcp timeout.

Core services (CERN) report:

DB services (CERN) report: - The apply process at ATLAS OFFLINE was aborted on Friday afternoon when trying to replicate the statements in order to drop the tables from one schema. The problem is a known bug, reproduced on ATLAS after setting a new parallel Streams setup between the ONLINE and the OFFLINE databases to replicate the PVSS schemas. This bug is assigned to Oracle development but the progress is very slow. The workaround is to setup schema rules at the apply side. This change will be implemented this week.

- The corruption found on the atlas online server is not affecting services, but an intervention is scheduled on Wednesday from 14:00 till 14:45 to fix the issue via a switch to a new database using Oracle dataguard/standby technology.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Andrea, Simone, Roberto, Patricia, Ricardo, Gavin, Luca, Harry, Julia, Nick, Jean-Philippe);remote(Jeremy, Michel, Ricardo).

elog review: Related to the CMS report

Experiments round table:

  • CMS: 1) CRUZET-4 started, regarding the CMS summary of the shifts it has been found one run with no events. This issue, already observed in the past has still to be understood. 2) Transfers T0-T1 via Phedex working smoothly with a transfer rate over 600MB/s. 3) Regarding the CERN-FNAL transfers, the problems observed will tr to be solved using the CERN FTS server instead the server placed at the T1 4) Still discussing the list of T2 sites which will enter the next production.
  • ATLAS: Follow up of the report presented yesterday. The problem observed with CASTOR@T0 during the weekend seems to be a combination of several issues: a bug included in the algorithm responsible of the file system. The problem has been solved and a new patch will be probably applied tomorrow (transparently to the experiment). In addition 100 TBs will be requested to the ATLAS pool as soon as Kors comes back the next week. Finally the setup of the garbage collection of the default pool will be simplified and changed basically following: 1st comes, 1st goes way. Few problems observed also today, all of them notified to the corresponding responsible: CASTOR@CNAF is showing 100% of failures, site in Michigan unreachable, and Oracle problems observed at RAL being checked by Jeremy.
  • LHCB: Issues (most probably coming from local network interventions) observed at SARA have been reported via a GGUS ticket. Although the reason has not been totally clarified, the ticket has been closed as soon as the problem dissapeared. Several problems observed at PIC and GridKa and also some few italian sites while installing the new software in the corresponding software area. The experiment recommends the use of static accounts instead of pool accounts, but in any case this is still and open (and quite old and well known) issue. Finally the upgrade of Dirac2 to include the SRMv2 clients has been announced.
  • ALICE: The experiment follows the MC production. There is an issue observed with the CREAM CE deployed at GridKa for Alice testings. Basically the std.err and the std.out files cannot be currently retrieved into the VOBOX. The issue will be followed with the developers and the site managers together with the experiment software experts.

Sites round table: Oracle issue at RAL being followed by Jeremy. Michel asked ATLAS about the elog entry reported during the weekend.

Core services (CERN) report: Nothing to report

DB services (CERN) report: stream replication for LHCb finished

Monitoring / dashboard report: Development performed for CMS to calculate the site availability is also in place for all VOs. Still the calculation is not included but it is possible already to visualize some results. VOs are required to check the links and provide feedback:

Release update: gLite upgrade already available although still not included in the production repository. Those sites which have already migrated do not required any further operation. Those sites which still have not migrated should wait for the corresponding announcement. The reason is the Oracle testing which still has not finished. 2) Regarding the FTS service for SLC4, feedback from the experiments and sites is required. The issue will cbe discussed during the next weekly operations meeting.

AOB: Nothing reported

Wednesday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r14 | r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2008-08-19 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback