Week of 080818

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Simone, Jean-Philippe, Harry, Luca, Ricardo);remote(Gonzalo, Derek, Daniele, Michael, Jeff).

elog review:

Experiments round table:

ATLAS (SC): Had 3 rather critical problems over the weekend. 1) Many Castor errors trying to export reprocessed data due to its being on tape not disk. These data were garbage collected before older data whereas we thought GC was a FIFO operation. We will follow up with CASTOR operations. Despite this all cosmics were exported (except to RAL) though at low efficiency and DAQ ran into CASTOR at 750 MB/s in 12 hours slots. 2) Affecting all of the Tier 1 is that during the weekend the data set type name embedded at the beginning of each file changed from data08_cos to data08_cosmag without advance warning to allow Tier 1 sites to switch to a new storage directory mapping. We will follow up this poor communication with the ATLAS management. 3) RAL had several problems over the weekend . On Saturday their SRM Oracle data base was giving errors then on Sunday they were not accepting FTS data from CERN though they were still exporting. We found the CERN channel set to 0% and it would have been good to tell us that had been done. Today I can see RAL is in an unscheduled downtime. D.Ross explained their CASTOR was down from 07.00 to 09.00 Saturday then Sunday they had first an LSF disk full (which stops staging) then a database disk full which lead to the unscheduled downtime. The expert called out set the CERN-ATLAS FTS channel share to zero as it was the only one active. H.Renshall said we should follow up how sites could indicate such a configuration change (e.g. site statusTwiki).

CMS (DB): CMS have started the CRUZET4 (cosmics but with magnet on) run with a couple of subdetectors in so far. They have a dataops shift in place concentrating on Tier 0 workflows.They have daily meetings at 16.00 and will use the CCRC08 elog for general (unstructured) observations and their cruzet3 elog for more detailed reports. H.Renshall invited them to make relevant observations to these minutes.

Sites round table: Jeff (NL-T1) reported they are at risk today and tommorow while they change network routers. Will be application transparent unless a cable switch over exceeds the tcp timeout.

Core services (CERN) report:

DB services (CERN) report: - The apply process at ATLAS OFFLINE was aborted on Friday afternoon when trying to replicate the statements in order to drop the tables from one schema. The problem is a known bug, reproduced on ATLAS after setting a new parallel Streams setup between the ONLINE and the OFFLINE databases to replicate the PVSS schemas. This bug is assigned to Oracle development but the progress is very slow. The workaround is to setup schema rules at the apply side. This change will be implemented this week.

- The corruption found on the atlas online server is not affecting services, but an intervention is scheduled on Wednesday from 14:00 till 14:45 to fix the issue via a switch to a new database using Oracle dataguard/standby technology.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Andrea, Simone, Roberto, Patricia, Ricardo, Gavin, Luca, Harry, Julia, Nick, Jean-Philippe);remote(Jeremy, Michel, Ricardo).

elog review: Related to the CMS report

Experiments round table:

  • CMS: 1) CRUZET-4 started, regarding the CMS summary of the shifts it has been found one run with no events. This issue, already observed in the past has still to be understood. 2) Transfers T0-T1 via Phedex working smoothly with a transfer rate over 600MB/s. 3) Regarding the CERN-FNAL transfers, the problems observed will tr to be solved using the CERN FTS server instead the server placed at the T1 4) Still discussing the list of T2 sites which will enter the next production.
  • ATLAS: Follow up of the report presented yesterday. The problem observed with CASTOR@T0 during the weekend seems to be a combination of several issues: a bug included in the algorithm responsible of the file system. The problem has been solved and a new patch will be probably applied tomorrow (transparently to the experiment). In addition 100 TBs will be requested to the ATLAS pool as soon as Kors comes back the next week. Finally the setup of the garbage collection of the default pool will be simplified and changed basically following: 1st comes, 1st goes way. Few problems observed also today, all of them notified to the corresponding responsible: CASTOR@CNAF is showing 100% of failures, site in Michigan unreachable, and Oracle problems observed at RAL being checked by Jeremy.
  • LHCB: Issues (most probably coming from local network interventions) observed at SARA have been reported via a GGUS ticket. Although the reason has not been totally clarified, the ticket has been closed as soon as the problem dissapeared. Several problems observed at PIC and GridKa and also some few italian sites while installing the new software in the corresponding software area. The experiment recommends the use of static accounts instead of pool accounts, but in any case this is still and open (and quite old and well known) issue. Finally the upgrade of Dirac2 to include the SRMv2 clients has been announced.
  • ALICE: The experiment follows the MC production. There is an issue observed with the CREAM CE deployed at GridKa for Alice testings. Basically the std.err and the std.out files cannot be currently retrieved into the VOBOX. The issue will be followed with the developers and the site managers together with the experiment software experts.

Sites round table: Oracle issue at RAL being followed by Jeremy. In addition Jeremy reported some DPM problem at some of the UK sites when they changed the size of their spacetokens. The bug # is 40273. Michel asked ATLAS about the elog entry reported during the weekend.

Core services (CERN) report: Nothing to report

DB services (CERN) report: stream replication for LHCb finished

Monitoring / dashboard report: Development performed for CMS to calculate the site availability is also in place for all VOs. Still the calculation is not included but it is possible already to visualize some results. VOs are required to check the links and provide feedback:

Release update: gLite upgrade already available although still not included in the production repository. Those sites which have already migrated do not required any further operation. Those sites which still have not migrated should wait for the corresponding announcement. The reason is the Oracle testing which still has not finished. 2) Regarding the FTS service for SLC4, feedback from the experiments and sites is required. The issue will cbe discussed during the next weekly operations meeting.

AOB: Nothing reported

Wednesday

Attendance: local(DanieleB, Harry,Miguel Anjo, Roberto, Patricia,Sophie, Olof, Jean-Philippe,Julia,Simone);remote(Derek, Gonzalo,Jeremy,MichaelE,Jeff).

elog review: summary CMS

Experiments round table:

  • CMS Tier-0 workflows: First runs yesterday (CRUZET-4 day-2) have been reported as successfully repacked and promptly reco'ed. Some later runs had missing files, guilty components have been identified, being fixed. Distributed Data Transfers: CRUZET-4 data transfers started. T0-T1: overall quality looked fine all through the day, except CNAF (note: scheduled downtime). T1-T1 trafic nothing of relevant.T1 processing: FNAL noted to be marked with a yellow warning on the SiteView: 50% of analysis jobs successful; failure comes primarily from oneuser trying to write to existing files in an SE, not a site issue. All other T1 sites marked as green in all CMS-specific tests. Julia: do you use site status board? Daniele: yes, extremely useful. Harry and Gonzalo: why do you re-route to FNAL-PIC. This is a Phedex feature that allows to choose the most convenient path and does not depend on the quality of thee transfers. Phedex updates constantly its view of various site-connections rating them on the basis of their performances and speed.
  • ALICE: no huge activity. Production ramped down for preparring the first cycle of the next large MC production. Deployment of xrootd to all Tiers 2 (over DPM) that will be in the production mask as soon as they will restart. Conitnuing the all day testing activity.
  • ATLAS: various problems reported last week are now fully under control. Simone reports that the week now of SD at RAL for ATLAS is fairly worrying and the fact that the window of this downtime is continously increasing should perhaps require a postmortem analysis. Derek replies this is still under investigation, they cleaned up rogue entries but not still clues of the problem that might reside in the oracle backend. Jeremy asked whether the large number of aborted jobs at UK T2 will be redistributed somewhere else being RAL not available Yes they will. There is also techincal discusison about how to re-associate a T2 to another T1. Simone presented also the next week plans. He's reporting that the FDR-2 phase-c will start mostl likely next Tuesday for the following three days. This is basically a full test that will also involve ATLAS T1s (the previous phase-b involved only the T0). This is not much data (no special requests for sites) but with the same priority of a data-taking like activity. Also shift procedures will be exercised.Tomorrow the announcement of 4 new dataset projects.
  • LHCb: CCRC-like activities ongoing through DIRAC3 together with normal prodution still through Dirac2. Problems with gfal_ls at CNAF due to CASTOR in downtime (GGUS 39890) and accessing shared area (because some shortage of the GPFS disk servers). Problem accessing at RAL CondDB from Brunel application (correlated with the ORACLE problem there), problem with libdcap at PIC (GGUS 39898) , fixed as soon as LHCb started using the local dcap installation. Gonzalo reports that the dcache clients available in the WN are there since three months from now. NL-T1, GRIDKA and IN2P3 are running smoothly. Remarkable that at Lyon (with DIRAC2) LHCb managed to run a job accessing all input files via xrootd. Plans to use xrootd both at IN2P3 and SARA. General remark: we experienced at sevrral sites failures accessing through CORAL the ConditionDB information with jobs crashing. This is under investigation to exclude it is a problem server side. Jeff comments: his understanding is that xrootd had no real authorization except for the ALICE model. So if both LHCb and ALICE want to use xrootd, LHCb and ALICE will likely need to agree that it is ok to see each other's data or alternatively to ask to setup a a dedicated xrootd, which probably means a dedicated dcache, which we might choose not to do (it would cost money -- another set of machines,and more manpower as dcache is not so stable). Hence it is a good idea to ask SARA what is possible. Gonzalo also comments about the need to have a sensor for the CondDB service.

Sites round table:

JT: bad idea for the experiments to hardcode the name of the CE. NIKHEF is going to replace its historical hostname tbn20.nikhef.nl. The Computing Element service at host tbn20.nikhef.nl (grid site
NIKHEF-ELPROD) will be stopped next week. On Friday August 22 at 10:00, tbn20 will be removed from the information system. Then, new jobs cannot be submitted to tbn20 anymore,
but already submitted jobs can still complete. On Tuesday August 26 at 10:00, the CE service at tbn20 will be stopped. A replacing CE service is already operational at host gazon.nikhef.nl.Soon an additional CE will be installed on host trekker.nikhef.nl. Rely on the BDII is always a better idea...

Core services (CERN) report:

Nothing

DB services (CERN) report:

The scheduled intervention on the ATLAS ONLINE database supposed to finish before three o'clock didn't yet because some small problems. The intervention consists in fixing a corruption on the storage side by moving the database to a new storage.

Monitoring / dashboard report:

Nothing

Release update:

Nothing

AOB:
Nothing

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r14 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2008-08-20 - RobertoSantinel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback