Week of 110110

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Cedric, Julia, Maria, Jamie, Manuel, Dirk, Peter, Ueda, Stefan, Maarten, Massimo, Ignacio, MariaDZ, David, Luca);remote(Michael, Jon, Gareth, Alessandro, Rolf, Gonzalo, Ron, Tore, Suijan Zhou, Rob, Daniele).

Experiments round table:

  • ATLAS reports -
    • CERN-PROD_LOCALGROUPDISK SRM errors GGUS:65932
    • Problem with BNL voms server GGUS:65944. Fixed [ Michael - problem with VOMS service not server. Admin mistakenly thought that a particular certificate was not used; once original cert restored yesterday morning all ok. ]
    • LFC problem at GridKa : GGUS:65942

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
    • Tier1
      • No outstanding issues
    • Tier-2
      • No outstanding issues
    • AOB
      • Mail Gateway issue at CERN : affected sendmail from CMS hypernews, not user mails. One of these gateways was overloaded from an lxplus machine sending over 100,000+ mails. CERN/IT investigating who caused this volume of traffic
      • CRC this week (starting tomorrow) : Stefano Belforte (connecting remotely...)

  • ALICE reports -
    • T0 site
      • Efforts ongoing to make AliEn v2.19 work for users this week. Production continues and CAF is available for analysis on limited, pre-staged data sets (new data sets can be requested).
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports - MC productions on going. Need to rerun the stripping for two streams (CHARM FULL and CHARM CONTROL). This is a very huge activity over all 2010 data.
    • T0
      • NTR
    • T1 site issues:
      • IN2p3: still problem installing software in their AFS area (vos release problem). A meeting LHCb-IN2p3 is currently being held to discuss about the status of the shared area and plans for the future. [ VOS release has now been done ]

Sites / Services round table:

  • BNL - nta
  • FNAL - ntr
  • RAL - had been running with some ATLAS FTS channels turned down to 50% of normal channels as one of ATLAS areas was getting full. Changed this morning to 75% of nominal. Will restore asap.
  • CNAF - ntr
  • IN2P3 - nta
  • PIC - a pnfs glitch this morning due to human error. Caused dCache service to be down for a couple of hours.
  • NL-T1 - downtime next week Tuesday to attempt to switch Oracle over to new h/w (7 node RAC). Planned in December but had to be postponed due to faulty network h/w.
  • NDGF - ntr
  • ASGC - ntr
  • KIT - issue with FTS now fixed.
  • OSG - ntr

  • CERN DB - ntr
  • CERN storage - this morning did CASTOR CMS upgrade to 2.1.10. Also upgrading stager and srm DB to 10.2.0.5.
  • CERN Grid services - during last two weeks seen some degradation of batch system. Correlated to high rate of submission from grid ILC batch queues. Raised ticket to find out why. GGUS:65965

AOB:

Tuesday:

Attendance: local(Eddie, Roberto, Maarten, Jamie, Maria, Gavin, Miguel, Ueda, Stefan, Massimo, Julia, Simone, Jacek);remote(Tiju, Stefano, Jon, Ulf, Jeremy, Rolf, Suijan, Alessandro, Rob, Xavier).

Experiments round table:

  • ATLAS reports -
    • ATLAS distributed computing system downtime for database reorganization on 17-18 Jan,
      • start draining on 16 Jan, in the evening
    • ATLAS restarted a series of data transfer measurements of full matrix (every site - every site)
      • excepting the sites declared as not appropriate to the tests
      • sites would observe transfers from/to unusual sites
    • Please provide us the pointer to the CERN FTS monitor [ Gavin - it is https://fts-monitor.cern.ch/ ]

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still down [ Miguel - is there any Tier0 issue? A - no, waiting for new version of CMS SW to start reprocessing when new cosmic ray run. Not processing data, but not broken. ]
    • Tier1
      • No outstanding issues
    • Tier-2
      • No outstanding issues
    • AOB
      • CRC-on-duty : Stefano Belforte
      • Meeting to discuss Dashboard problems during Xmas break. For scheduled downtimes someone working on it 100% will have new version of SSB collector in production in one week. Will report at next CMS Facilities meeting on Monday.

  • ALICE reports -
    • T0 site
      • Efforts ongoing to make AliEn v2.19 work for users this week. Production continues and CAF is available for analysis on limited, pre-staged data sets (new data sets can be requested).
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report
  • LHCb reports - MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
    • T0
      • NTR
    • T1 site issues:
      • IN2p3: After the SW has been installed yesterday MC jobs ramped up at IN2p3-CC and IN2P3-T2 centers.
    • AOB - conditions DB SAM test failing yesterday in 5 sites. (One of tests in critical availability - for time being taken out of critical test list)

Sites / Services round table:

  • IN2P3 - ntr
  • RAL - FTS channels for ATLAS; updated now to full values
  • ASGC - ntr
  • FNAL - ntr
  • NL-T1 - downtime tomorrow for top level BDII, CREAM CE, WMS, to be migrated to ore stable host
  • KIT - ntr
  • CNAF - ntr
  • GridPP - ntr
  • OSG - question: had people at FNAL who run VOMS for CDF and D0 and asked whether they should registered in GOCDB. Who to contact? A: Tiziana Ferrari

  • NDGF We have srm.ndgf.org pool software updates + tape system expansion tomorrow. AT_RISK has been scheduled, as some Atlas and Alice data might be unavailable.CSC T2 site has the ARC-CE node jade-cms.hip.fi down with hardware problems.

  • CERN VOMS On Thursday 13th January at 10:00 CET the host certificate for lcg-voms.cern.ch hosting VOs dteam, cms, atlas, alice, lhcb and ops will be updated. A new lcg-vomscerts 6.3.0 was released before the new year.

  • CERN DB - this morning ALICE online DB down for 3h due to power cut a pit.

AOB:

  • Next WLCG T1SCM on Thursday 20 January - agenda to be circulated shortly.

Wednesday

Attendance: local(Jamie, MariaDZ, Ueda, Lola, Eva, Stefan, David, Gavin);remote(Jon, Francesco Norefini, Rolf, Joel, Michael, Onno, Tiju, Alessandro, Paolo, Suijan, Rob, Stefano).

Experiments round table:

  • ATLAS reports -
    • ATLAS will switch the certificate used for data transfers via DDM in the near future.
    • CNAF-BNL network problem (slow transfers) GGUS:61440 (since 2010-08-23)
      • The last update on 2010-12-17: "News from NRNs side: It seems the "Production Path" has been cleared. A commissioning phase is ongoing." [ Michael - production link has been re-engineered - a new circuit has been setup. Tests initiated by ESNET engineering which look fine so far. They want to ramp-up traffic so that it looks like production to see if there is packet loss. Engineering still working, ESNET, USLHCNET and GEANT involved. News in next days ]
      • ATLAS can perform some test transfers once the "commissioning" is done and the sites request it.

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still not running, waiting for next cosmic ray runs
      • CMS Dashboard issues: all big problems reported lately solved now. Still many Savannah opened. Will review at next Monday CMS ops. meeting.
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone. Start seeing resource saturation effects (jobs pending at some site since several days)
    • AOB
      • CRC-on-duty : Stefano Belforte

  • ALICE reports - Problems with user job submission. Work still ongoing to fix. Big improvements but still not fully fixed. Downtime in web interface for MonaLisa webpage not accessible since 13:30. Should have been down for 90' but still not back.

  • LHCb reports - MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
    • T0
      • NTR
    • T1 site issues:
      • IN2p3: After the SW has been installed yesterday MC jobs ramped up at IN2p3-CC and IN2P3-T2 centers.

Sites / Services round table:

  • NDGF: Our disk pool software upgrade will start soon. IMCSUL and IMCSUL-INF, two T2 sites are gone from the internet, we don't know if it is power or network errors as we have not had any connection to the machines there for 3h. They have been downtimed in GOCDB.

  • FNAL - ntr
  • CNAF - nta
  • IN2P3 - ntr [ Joel - didn't receive EGI broadcast saying intervention had started. BTW it is a downtime for LHCb, not an AT RISK. Rolf - this is business of portal people - don't know if there is a problem or not. ]
  • BNL - ntr
  • NL-T1 - announcement: next Tuesday Jan 18 scheduled downtime to move Oracle DB back to original h/w. Affects FTS and LFC ATLAS and LHCb.
  • RAL - ntr
  • ASGC - ntr
  • KIT - ntr
  • OSG - ntr. Was able to contact Tiziana about CDF and D0 VOs. In good shape there!

  • CERN DB this morning ALICE online upgraded to Oracel 10.2.0.5. At moment CMS ARCHIVE DB being done. More scheduled for next week and week after. LCG DB proposed Tue 25 and next Monday split of ATLAS offline DB. DQ2, PanDa and prodsys to new cluster. 3 hours full downtime and then these 3 apps will be available R/O until 18:00.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 07-Jan-2011

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2011-01-12 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback