Week of 080519

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

Monday:

See the weekly joint operations meeting minutes

Additional Material:

Tuesday:

Attendance: local(Olof, Roberto, James, Jamie, Simone, Flavia, Julia, Andrea, Patricia);remote(Michael, Gonzalo, JT, Derek, Vincenzo INFN/Bari).

elog review:

Experiments round table:

  • LHCb: main issue since yesterday (since vulnerability issue in UK) & after AFS ugprade. Andrew, using UK-issued cert cannot run any transfers over night/morning. Restarted using Robbie's credentials. Banned IN2P3; LHCb clients based on currently in production version of gfal (1.10.8) were crashing against IN2P3. We will move then to latest version (1.10.11) and check carefully whether this fixes the problem. If not we will pass the ball to Remi&Co. Re-integrated SARA SRM endpoint and running extremely smoothly - recovering backlog quickly - until credentials problem. JT - WN there now. Not 100% yet... Jobs will start to come. SRM CASTOR instance at CNAF - endpoint down, restarted, although scheduled downtime today. Recons - PIC, RAL, CNAF, CERN smooth. Restarting at NL-T1. Lots of jobs waiting at FZK - exhausted share after many jobs over w/e. Stripping - da VInci crashing due to mem leak. Ancestors info not available - still being investigated by book-keeping experts.

  • ATLAS: castor at CERN - problem appeared about 01:00. Not noticed until this morning. Submitted ticket at 09:00. Problem cured < 11:00. Outtage ~10 hours, but response time quick once reported. Meeting with castor people this morning. Discussed main two issues observed during May. 1). Error message "exhausted # threads" - patch about to be deployed. Olof - SRM down overnight, not castor instance. Same problem as late last week - cgsi problem, all threads stuck in gsoap layer. No possibility to create new threads. Still SLC3. 2nd problem - deadline problem - also slowness of stager to putdone, srmrm, to be patched tomorrow. srm release today? Flavia - Yes. Olof will check if this will be upgraded tomorrow (or immediately?) Simone - tomorrow throughput test starts. 09 - 11 tomorrow is ok - later postpone. Simone - also affected some exports to BNL end last-week. Draining q of M7 transfers, will continue for a few hours, then setup machinery for export tests. Mail to ATLAS T1s with rates etc, also global rate (1.1-1.2GB/s out of CERN). Follows computing model as per usual. Sites can be over-subscribed to tape - also disk (except BNL...) to test if sites can take more. T2s will also participate based on shares given by clouds. T2s can also over-subscribe. MC production - finishing for FDR2. Tail - Taipei lost 170K files last week(!). Some critical for FDR2 - being recreated. Affects only 3 clouds - FR, US & CA.

  • ALICE: Got srm e-p for castor area at RAL. All T1s ready for data transfer. Alien finished upgrade to 2.15 at CERN (central services) will begin installation at sites this week (not yet announced). Sites not yet fully ready: Birmingham (site manager hopes to be ok), Kosice (not expected to be ready). For remaining sites should be ok. Overlap with ATLAS? Simone - next week kept for contigency. JT: NIKHEF failing (warn) ALICE SAM test - VO box proxy renewal. Patricia - normally defined as critical. In some cases test can fail but test needs to be adapted to new VO box. Removed from critical tests for moment pending migration. Will be reinstated later. For the moment "don't worry".
Sites round table:

  • NIKHEF: (part of NL-T1) we have just increased our installed capacity by 120 TB compared to yesterday. Furthermore we now have in total 88 TB of space allocated to ATLAS. Problem with cooling last week hence WNs. Migration of power from backed up to non-backup. WNs on "cheap power". Still not complete with this migration. Not clear yet how many will be able to be turned out. Replacement part "propeller like thingy' stuck on boat in harbour in Marseille and delayed by strike(!). Play-by-ear. ATLAS say storage rather than WNs please. DPM - problem with resize of space reservation. This resets lifetime to 1.7 days (1.6.7). Asked by ATLAS to upgrade - ongoing now.

Core services (CERN) report:

  • Some details about upgrades tomorrow and Thursday - CMS asked for more details of what is involved. Fixes slow stager rm - hit ATLAS hard and could hit also CMS. Problem - index not properly handled and so "hint" had to be added to avoid full table scan. Simone - problems also caused by internal FTS logic (cleanup actions triggered if file transfer fails.)

DB services (CERN) report:

Monitoring / dashboard report:

  • Two instances of collector running in parallel by mistake - too much load on DB. Affected CMS dashboard. Killed and now ok. CMS running analysis exercise. Want to follow i/o rate between storage and WNs. Request to setup Mona Lisa for this. Hope to setup this week.
  • ATLAS dashboard upgrade went smoothly yesterday as scheduled.

Release update:

AOB:

  • elog now available? Some problems due to config change. Down 14:00 - 14:30 - should be ok now.

Wednesday

Attendance: local(Nick, Jamie, Roberto, Patricia, James, Julia);remote(Michael, Derek, Simone).

elog review:

Experiments round table:

* ATLAS (Simone): throughput test started this morning - generation of data at 09:00. Problem with registration in ATRLAS catalog. First data after lunch. RAL/CNAF problems - solved in 'no time'. In last 1 hour all running smoothly. Throughput as expected. SARA endpoint missing -in scheduled downtime. Channel inactive. Roberto - intervention over. Simone - transfers queued in FTS - will contact SARA people. Michael - are we currently at nominal? Only see 120MB/s to BNL. CERN-BNL channel rather full. 4 active transfers, 26 ready. See from FTS logs if t/p per transfer ok. FTS logs published but only from day before. RAL - problem with CA. Did not upgrade CA. Graeme Stewart sent top priority e-mail now fixed. CNAF - f/s behind SToRM f/e not mounted. Re-mounted and now works.

* LHCb (Roberto): observations to report - LHCb wants to confirm same issue at RAL as ATLAS. Echo of UK CA upgrade. Local contact confirmed upgrade of certs into CASTOR diskservers later today. GV plots - not transferring out of CERN. Problem with LHCb online servers at pit and thence out to T1s. Not known when will be fixed. Huge backlog of jos to SARA & IN2P3. Smooth transfer rate over last few days. IN2P3 - after 4 days kept off production mask due to gfal issues. Still facing issues with gsidcap server. Filled elog entry with plot showing CPU efficiency over last week for IN2P3. Essentially 0 which is not considered acceptable for a T1(!). Upgraded gfal to recommended level but issues still there. Stripping - run happily test jobs with patched version of da Vinci. Hope to restart stripping this week. recons fine at all T1s + T0 except IN2P3 and SARA. Few space token issues at several sites - ggus tickets opened. Site admins very responsive. Policy for deploying new CAs - policy at JSWG. Should be fully transparent to VO - clearly not the case. Keep old and new for transition period. Not case.

* ALICE (Patricia): not making any production. Transfers. Still making transfers with FTD from Alien 2.14. New version - 2.15 - not being used. 2 groups - 1) dummy files to check channels and keep FTS active. 2) Some real files to FZK and NDGF. working ok. Dummy fiels to CNAF - several issues, mostly to SE name in FTD config (not site ) IN2P3 and SARA - Lyon upgrading VObox only certified today. Hope will start once 2.15 in case. + for SARA will not make transfers until 2.15 in place. RAL (new since Feb) - SRM endpoints provided yesterday. elog issue can be closed. Info in config file - test channel for first time.

Sites round table:

Core services (CERN) report:

  • LCG OPN: We need to upgrade the software of the two CERN routers that connect to the Tier1s (there is a bug that affect the correct routing for the backup paths in some cases). It is necessary to reboot them, that means a downtime of 5 minutes each. We will first upgrade one, than few days later we'll proceed with the second.

    Would it be OK if we schedule one reboot for Monday 26/5 and one for Wednesday 28/5? Or do you prefer we postpone them to later dates?

DB services (CERN) report:

Monitoring / dashboard report:

  • Moved gridmap onto new h/w - modern mid-range server. Gets about 25K hits a day. Should be faster!

  • A draft of the report to next week's MB is attached to this page. See the break-down of various service issues seen last week (and reported on this). Did the existing monitoring / logging / reporting pick up the issue? Are there specific actions?

Release update:

AOB:

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

  • RAL: Please note that RAL is closed for a public holiday on Monday 26th. This means that any problems caused by this %RED(LCG OGN intervention above) will be dealt with by the on-call system and responses will not be as swift as on a normal working day.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

-- JamieShiers - 16 May 2008

Topic attachments
I AttachmentSorted ascending History Action Size Date Who Comment
PDFpdf Can_FTS_transfers_to_BNL_be_looke_into__-_Primary_link_CERN_-_BNL_is_down.pdf r1 manage 53.5 K 2008-05-21 - 15:56 JamieShiers Follow-up on transfers CERN-BNL (Simone, Michael)
PowerPointppt ccrc08-May27.ppt r2 r1 manage 2056.0 K 2008-05-22 - 08:53 JamieShiers Draft of weekly report to MB of 27th May - monitoring / logging / reporting post-mortem of issues reported at this week's MB (last week's issues!)
PDFpdf ESnet-problem-05-21.pdf r1 manage 21.7 K 2008-05-22 - 08:43 JamieShiers  
Edit | Attach | Watch | Print version | History: r10 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2008-05-22 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback