Week of 100816

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Miguel, Jean-Philippe, Cedric, Ricardo, Edward, Lola, Luca, Dirk, Harry, Carlos, Alessandro);remote(Michael/BNL, Gang/ASGC, Catalin/FNAL, John/RAL, Ian/CMS, Daniele, Jens/NDGF, Vladimir/LHCb, Angela/KIT, Kyle/OSG, Ron/NLT1, Riccardo/CNAF, Pepe/PIC).

Experiments round table:

  • ATLAS reports -
    • Tier-0 - Central Services:
      • T0 : File disappeared from Castor namespace (ticket 61175). Miguel asks for more information like filenames.
      • Person from who we use the certificate on Panda machines kicked out from VO ATLAS. Need to install a new certificate.
    • Tier-1:
      • KIT : Software area issue : Friday afternoon, intervention of the vendor. Problem couldn't be fixed before week-end. In agreement with KIT people, decided to start with a clean new file system and reinstall all ATLAS release. On Saturday, some PBS problems finally solved. On Sunday site online.
      • KIT : Transfers problem from 1 UK T2 (UKI-NORTHGRID-SHEF-HEP ) and 3 FR T2s (GRIF-LAL, GRIF-IRFU, GRIF-LPNHE) (ticket 61128 and 61134). Error :
[SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://blahblahblah/srm/managerv2]. At the time the ticket was submitted, all sites were up and running. Looks like a network issue (see also GRIF-IRFU->SARA problems mentioned on Friday : ticket 61049 ). This problem seems to become more frequent now we are doing inter-cloud transfers (T2 cloud A -> T1 cloud B). Who should address it ?
      • CNAF : SRM problem (ticket 61130). Actually "we had a network problem which inpacted on some Atlas services like conditions DB, LFC and the site FTS. We are investigating to understand if it could be the cause of these job failure.". No problems with LFC/FTS were noted.
      • BNL : WN blackholes (removed from production).
    • Tier-2:
      • GRIF-IRFU : srm errors (not same problem as mentioned above).

2 unrelated issues : Some WNs have lost the sw mount (Ticket 61090) + staging problem (ticket 61083).

      • ASGC : "Some disk servers have insufficent bandwidth for data transfers to other T1's (CNAF, FZK, LYON, SARA). We are upgrading bandwidth and some disk servers are affected by this action(c2fs081, c2fs082, c2fs083).". FZK, LYON looks OK now (no more backlog from ESD replication).
      • Failing transfers to SARA from GRIF-IRFU (ticket 61049). This is again network issues. Could it be related to the use of star channels in FTS? Will be used more and more.
      • RAL fully online in Panda now (was brokeroff till yesterday) after the fix of the lost files.
    • Tier-2:
      • Nothing to report.

  • CMS reports -
    • Central infrastructure
      • Nothing to report
    • Experiment activity *Luminosity over the weekend around 250pb-1
    • Tier1 Issues
      • Two site issues:
        • sr #116264: CE and JobRobot error in T1_DE_KIT. Interacting with site admins. Seems to be a problem with the batch system. Impacting some CMS Pre-production activities
        • sr #116278: SAM "mc" and "lcg-cp" tests failing at T1_FR_CCIN2P3. SRM seems to have been down for IN2P3 over the weekend.
    • Tier2 Issues No big issues
    • MC production New pred-production MC is ongoing.
    • AOB More reports than normal about sites dropping out of the BDII. Not clear if this is a central issue of a site issue. The sites were DESY and some polish sites. Site tickets have been submitted, no central ticket yet. Could the BDII expert look at the problem anyway?

  • ALICE reports -
    • Production:
      • T0 site
        • Nothing to report
      • T1 sites
        • Nothing to report
      • T2 sites
        • Some issues found at some T2 sites.

  • LHCb reports -
    • Experiment activities:
      • No data. Reconstruction, merging and MC.
    • Issues at the sites and services
      • T0 site issues:
        • First RAW file unavailable GGUS 61157 (solved)
        • xrootd problems GGUS 61184. Need help from LHCb to reproduce the problem.
        • Jobs failed at the same worker nodes GGUS 61176
      • T1 site issues:
        • IN2P3 LHCb can not use SRM GGUS 61156 (solved)
        • IN2P3 RAW files unavailable
        • CNAF many jobs failed at the same time
      • T2 site issues:

Sites / Services round table:

  • BNL: comment from Michael about the blackholes): nothing wrong found on the machines, but the URLs could not be accessed with wget. To be investigated.
  • ASGC: ntr
  • FNAL: ntr
  • RAL: ntr
  • NDGF: ntr
  • KIT:
    • PBS problems this weekend: seem similar to the problems seen at the beginning of the year. High CPU load. Nodes needed a reboot. Is it a consequence of the NFS problems seen by Atlas last week? Systems stable now.
    • There will be a restart of the Atlas LFC because of a new host certificate to be installed. Should only be a few seconds interruption.
    • SAM test failures for LHCb SW area. Want to move the SW area to another server. When? Vladimir says that it is ok today or tomorrow from LHCb point of view.
  • NLT1:
    • Oracle RAC problem: failover did not work as well as expected.
    • One pool node crashed.
  • CNAF: investigating the transfer problems between CNAF and BNL. Seems to work well with 1 Gb interfaces but not with 10 Gb interfaces? However the problem is not seen for local transfers using the same interface. Not MTU related.
  • PIC: ntr
  • OSG: 3 GGUS tickets received and being processed. Also working on understanding why some SMS messages from GGUS were lost.

  • Ricardo/CERN: acknowledge of LHCb ticket. Working on it.
  • Luca/CERN DB:
    • Streaming to CNAF stopped last week and was restarted on Saturday. The cause was a bad switch at CNAF.
    • LHCb online DB restarted this morning.
    • Atlas offline DB overloaded by DQ2. May be due to an HTTP server crash? Atlas will investigate.

AOB:

Tuesday:

Attendance: local(Harry, Ricardo, Jean-Philippe, Elena, Miguel, Luca, Patricia, Simone, Alessandro, MariaD, Edward);remote(Michael/BNL, Ronald/NLT1, Kyle/OSG, Vladimir/LHCb, John/RAL, Gang/ASGC, Roger/NDGF, Pepe/PIC, Jeremy/GridPP, Andreas/KIT, Ian/CMS, Luca/CNAF).

Experiments round table:

  • ATLAS reports -
    • Tier-0 - Central services
      • The problem with atlas-cc reported yesterday is related to the fact that the machines were not reachable between 12:00 and ~2pm yesterday. Machines were removed from the load balancing.
    • Tier-1:
      • CNAF: For the BNL-CNAF transfer problem, there is no misconfiguration of the servers. Still investigating.
      • PIC: problem with lfc catalogue. DDM activity was down for 4 h . Didn't effect production and analysis. GGUS:61211 is solved. lcg-vomscerts-5.9.0-1.noarch package was missing in LFC Oracle gLite3.2 repository, was manually installed.
      • NDGF-T1: transfer problems are seen since 2010-08-13 GGUS:61125, GGUS:61205 . The former hasn't been acknowledged for 4 days.
      • The problem with transfers from T2's in UK and FR cloud to FZK reported yesterday: channels were unblacklisted and transfers succeeded.
    • Tier-2:
      • INFN-NAPOLI-ATLAS: Problem with T0 export to CALIB disk. GGUS:61093. Some of the errors are due the fact that files with size > 4 GB are taking more than 3600 sec to be transferred to the site. Atlas muon calibration experts are asked to consider to limit file size to 2-3 GB. There are currently 18 parallel transfers which is may be too much for 1Gb connection, but a few weeks ago the data rate to Napoli was much higher.

  • CMS reports -
    • Central infrastructure
      • Nothing to report
    • Experiment activity *Changing of the number of colliding bunches later in the week will result in a temporary higher trigger rate. Not expecting problems but may see higher resource utilization. Miguel: will this generate more files or larger files? Mainly more files. The data rate increase could be anywhere between 25% and a factor 2.
    • Tier1 Issues
      • Two site issues:
        • sr #116264: CE and JobRobot error in T1_DE_KIT. Interacting with site admins. Seems to be a problem with the batch system. Impacting some CMS Pre-production activities. Solved waiting for data ops confirmation to close.
        • sr #116278: SAM "mc" and "lcg-cp" tests failing at T1_FR_CCIN2P3. SRM seems to have been down for IN2P3 over the weekend. Closed.
    • Tier2 Issues No big issues
    • MC production New pred-production MC is ongoing. First samples out.
    • AOB Peter Kreuzer takes over as CRC tomorrow.

  • ALICE reports -
    • T0 site
      • A decrease of the production has been observed this morning. The software installation service of AliEn (packMan) was having some problems in voalice13 stopping the production. Solved at 11AM, the number of jobs increased after such operation
      • All CREAM-CE systems have been checked this morning
    • T1 sites
      • FZK: following the information provided by the CREAM-CE systems at the site this morning, no CREAM system was available. Issue reported to the site responsible this morning. Andreas: the incorrect information published in the Information System was fixed a few minutes ago.
    • T2 sites
      • Italian federation: all sites were tested yesterday finding several issues reported to the Alice contact person in Italy

  • LHCb reports -
    • Experiment activities:
      • No data last day. Reprocessing (5K) and analysis (5K) jobs.
    • T0 site:
      • Miguel: 2 filesystems used by LHCb have been lost. No data loss, but need to recall data from tape, which could lead to delays in getting access to files.
      • Jean-Philippe: LFC for LHCb at CERN is overloaded. Many unfinished transactions which produce timeouts and keep threads busy unnecessarily. To be investigated.
    • T1 sites:

Sites / Services round table:

  • BNL: ntr
  • NLT1:
    • tape sub-system problem at SARA: impossible to read data from tape. Simone: is writing to tape affected? Will check.
    • kernel upgrade on storage nodes at SARA
  • RAL: ntr
  • ASGC: ntr
  • NDGF: works fine after dCache upgrade
  • PIC: everything working for Atlas now
  • GridPP: ntr
  • KIT: still problems of high load on some Worker Nodes disturbing the batch system and may be adding delays to qstat processing.
  • CNAF: still working on the BNL-CNAF transfer problems and working with LHCb for jobs debugging
  • FNAL: testing new load balanced SRM
  • OSG: Again problems with CERN BDII not publishing OSG sites. Expert (Laurence) contacted. GGUS ticket 61206 has been updated by Ricardo. Priority of the problem has been increased but better monitoring is still to be implemented.

  • Luca/CERN DB: performance problem with DB serving CMS dashboard. Being investigated.
  • Maria: Atlas team ticket 61203 should not have been assigned to CASTOR in PRMS, but to whom as the problem is linked to the LST2010 tests (EOS). Will ask Dirk.
AOB:

Wednesday

Attendance: local(Miguel, Elena, Peter, Gavin, Jean-Philippe, Patricia, MariaD, Edoardo, Luca, Simone, Alessandro, Lola, Edward);remote(Angela/KIT, Michael/BNL, Pepe/PIC, Kyle/OSG, Vera/NDGF, Vladimir/LHCb, Tiju/RAL, Catalin/FNAL, Onno/NLT1).

Experiments round table:

  • ATLAS reports -
    • Tier-0 - Central services
      • Oracle RAC for ATLAS Offline DB was unavailable around midnight. Email sent to PhysDB.Support and atlas-dba. High load. Problem being investigated.
    • Tier-1:
      • SARA-MATRIX: the database behind FTS (and LFC) service at SARA is down (GGUS:61265). SARA-MATRIX is in unscheduled DT. DB being restored from backup. Vendor called to investigate why the corruption happened.
      • NDGF-T1: small transfer problems are still seen and GGUS:61125 still hasn't been acknowledged. The site is at RISK today.
      • The problem with transfers from T2's in FR cloud (IN2P3-LPS, GGUS:61257) and in It cloud (INFN-FRASCATTI) to FZK is seen again. Under investigation.
    • Tier-2:
      • INFN-NAPOLI-ATLAS: Problem with T0 export to CALIB disk. GGUS:61093. After the discussion at the daily meeting and agreed by Management. The export of detector data from CERN to NAPOLI CALIBDISK has been stopped indefinitely. All subscriptions for data10 to NAPOLI CALIBDISK have been canceled. Functional tests will be kept.

  • CMS reports -
    • Central infrastructure
      • nothing to report
    • Experiment activity *LHC changing of the number of colliding bunches (48 circulating and 36 colliding bunches), maybe already this evening. As a result, CMS is expecting to see higher rates (400~500 Hz versus the usual rate of 250~300 Hz), however CMS Offline and Computing have agreed that this can be handled for the moment.
    • Tier1 Issues
      • Problem staging and staging out files in T1_TW_ASGC (no ticket so far) : the reason was identified to be a wrong setup of the T1production role at site. Experts working on the issue.
      • sr #116264: CE and JobRobot error in T1_DE_KIT from yesterday was related to unstable batch system. Reseting problematic Worker Nodes has resolved the trouble (GGUS ticket 61127 got solved on Aug 17, 17:00 UTC)
    • Tier2 Issues
      • T2_BR_UERJ is in scheduled downtime, however it is not reported as in downtime in the CMS SIte Status Board (no "worker symbol" set). This is actually a general and long standing open issue affecting CMS Computing Operations, see comments under AOB below.
    • MC production
      • New pred-production MC is ongoing.
    • AOB
      • As the example reported in 115207 is showing, downtime information from OIM is not always properly propagated to GOCDB which SSB is using as an information source. The solution would be to develop an additional SSB collector which would query OIM directly. The Dashboard team promised an urgent solution.
      • This is very important for CMS Computing Operations, e.g. for the Site Readiness reporting or of the Computing Shift Monitoring. We fully rely on the Dashboard team for such issues, since the downtime information is fully embedded in the SSB.
      • In addition, CMS is requesting that un-scheduled downtimes also included in the CMS SSB reporting
      • Kyle: a GGUS ticket is needed if OSG has to do something about this problem.
      • Pepe: actually this problem affects all sites in OIM
      • Alessandro: Atlas has implemented a way to query both GOCDB and OIM. Better to not duplicate effort.

  • ALICE reports -
    • T0 site
      • Replacement of the xrootd redirector voboxes (voalice07, 08 and 09). Hardware requirement submitted today
      • Raw reconstruction activities ongoing with no incidents to report
    • T1 sites
      • Ticket GGUS: 61228 submitted yesterday to RAL concerning the access to the CASTOR system (permission denied while opening files, more details included in the ticket). It is a permission problem on directory. CASTOR support at RAL would prefer if Alien can set the permissions correctly. Being discussed with Latchezar.
    • T2 sites
      • Usual operations, no remarkable issues

  • LHCb reports -
    • Experiment activities:
    • Issues at the sites and services
      • LFC problem:
        • Ricardo Graciani: From this plots we can see how both CNAF and SARA where "released" at 11:45 and started to transfer after sone time not being able to do it. Since CNAF LFC was not Active (and we have agreed that LFC was the caused of the jobs getting stuck), it is likely that the reason of the problem was LFC at NIKHEF. Suddenly this problem was cured at 11:45 UTC (yesterday) and stuck jobs got released At CNAF there was a huge number of affected jobs
        • The last plot (1 week view for SARA and CNAF) shows that the problem with NIKHEF LFC started morning of the 16th and was solved one day later. Some jobs at CNAF managed to execute since they are able to use other instances.
        • Actually the candidate LFC to have caused the problem is: lfc-lhcb.grid.sara.nl
        • Onno: this problem is not related to DB corruption as it started earlier.
        • Onno: the LFC crashed at SARA after an Oracle cluster upgrade on Monday; the problem was not noticed the same day because this particular NAGIOS plugin is not fully operational yet. Jean-Philippe: please check if there is a core dump available. Vladimir: would like to have a SIR about this incident.

Sites / Services round table:

  • KIT: ntr
  • BNL: ntr
  • PIC: ntr
  • NDGF: ticket has been updated
  • RAL: ntr
  • FNAL: ntr
  • NLT1: ntr on top of what was reported before. Alessandro: how long will it take to restore the DB? Don't know, will check. Luca/CERN can help if necessary.
  • OSG: problems with old data on BDII. Ticket 61206. Laurence and Ricardo working on it.

  • Luca/CERN DB: ATLAS DQ2 DB: one of the node (node 5) had an high load. Node rebooted. SLS reported 0% availability while actually the 4 other nodes were still providing the service.

AOB:

Thursday

Attendance: local(Elena, Jean-Philippe, Harry, Laurence, Edward, Patricia, Pablo, Peter, Jacek, Alessandro, Miguel, MariaD, Ricardo);remote(Michael/BNL, Xavier/PIC, Alexander/NLT1, Kyle/OSG, Angela/KIT, Gang/ASGC, Jeremy/GridPP, Tiju/RAL, Catalin/FNAL, Roger/NDGF).

Experiments round table:

  • ATLAS reports -
    • Tier-0 - Central services
      • Problem with file registration on T0. Services was restarted. Under investigation.
    • Tier-1:
      • SARA-MATRIX: the database behind FTS and LFC service at SARA is corrupted (GGUS:61265). SARA-MATRIX is in unscheduled DT. We set whole NL cloud offline in production and analysis. THe DB is still being recovered.
      • NDGF-T1: DCache fix was deployed yesterday. No major problem.
      • The problem with transfers from T2's in FR cloud (IN2P3-LPSC, GGUS:61257) and in It cloud (INFN-FRASCATI, GGUS:61270) to FZK is under investigation.
    • Tier-2:
      • No major problem with T2s

  • CMS reports -
    • Central infrastructure
      • nothing to report
    • Experiment activity
      • Still waiting for first LHC fill with 48 circulating and 36 colliding bunches.
    • Tier1 Issues
    • Tier2 Issues
      • nothing outstanding to report
    • MC production
      • New pred-production MC is ongoing.
    • AOB
      • Follow up of the OSG --> SAM --> Dashboard downtime reporting issue with Pablo Saiz and RSV SAM group. No further ticket was opened, since there were already 3 about this case :
        • 115207
      • BDII publication issue (ctnd)
        • As already mentioned by CMS in this meeting, we are reminding concerned people at CERN that the top-level BDIIs at CERN continue to be out-of-date -- they're almost one day out of sync according to CERN's own monitoring (see example in GOC ticket 9090). Do CERN gLite WMSes have their own top-level BDIIs ? Potentially this affects job rankings via WMS, stale SSB data, not to mention that lately a larger number of sites than normal had lost connection with the CERN BDII, which knocks out their SAM tests, so it is important and urgent to us.
        • This worry was also brought up at the OSG council meeting, where experts were discussing about potential mismatch between the OSG BDII and the CERN BDII operations policy (see OSG BDII SLA here)
        • Pablo: OSG maintenance status is not propagated. MYOSG web site could be used as backup source. This will be available at the end of next week.
        • Laurence: the issue about out of sync BDIIs has been understood (exception not caught) and a fix has been implemented.
        • Ricardo: the fix has already been deployed at CERN on one server. Will be deployed on the others if ok. Working on better monitoring.
        • Michael/BNL: worried about the support for this service. OSG tries to put a lot of effort to give a 24*7 service, but it seems that it is not the case elsewhere for this kind of service. SLS should be rediscussed at MB.
        • Ricardo: the main problem was the insufficient monitoring
        • Alessandro: as there are often missing entries in BDII, the client applications should implement some kind of caching as done in FTS.
        • Ricardo: sometimes the problems are also due to the CERN firewall overload.

  • ALICE reports -
    • T0 site
      • A new CREAM submission module is being tested at CERN. It increases the timeout limit for the glite-ce-job-status command and in addition it reinforces the multi CREAM-CE submissions. It is going to be simultaneously tested at CERN and Subatech (France) until the next week
    • T1 sites
      • T0-CNAF transfers: ALICE observed this morning the old error: "Unable to write file; multiple files exist.". Reported to the Alice contact person at the site, he has confirmed the update to the latest version of xrootd on the server. The issue appeared when restarting the services. Issue should be solved now
    • T2 sites
      • No remarkable issues to report

  • LHCb reports -
    • Experiment activities:
    • Issues at the sites and services
      • T0 site issues:
      • T1 site issues:
        • LFC at pic was not allowing any queries between Wed Aug 18 2010 17:48:18 GMT+0200 and Thu Aug 19 2010 07:32:37 GMT+0200.
      • T2 site issues:

Sites / Services round table:

  • BNL: ntr
  • PIC: killer queries again on ATLAS LFC due to too many FQANs in the proxy. PIC should increase the bug priority if a fix is urgent.
  • NLT1: ntr
  • KIT: ntr
  • ASGC:
    • bad config provoked CMS production failures. Fixed.
    • will be in downtime on Friday for network reconfiguration.
  • GridPP: ntr
  • RAL: ntr
  • FNAL: ntr
  • NDGF: ntr
  • OSG: ntr except the BDII issue already discussed above.

AOB:

Friday

Attendance: local(Elena, Peter, Jean-Philippe, Miguel, Luca, Alessandro, Edward, Patricia, Massimo, Ricardo);remote(Michael/BNL, Daniele, Xavier/PIC, Ronald/NLT1, Catalin/FNAL, Gang/ASGC, Riccardo/CNAF, Tiju/RAL, Jeremy/GridPP, Xavier/KIT, Kyle/OSG).

Experiments round table:

  • ATLAS reports -
    • Physics Run with 48 bunches per beam
    • Tier-0 - Central services
      • No problem
    • Tier-1:
      • SARA-MATRIX: the database behind FTS and LFC service at SARA is corrupted (GGUS:61265). SARA-MATRIX is in inscheduled DT. NL cloud offline in production and analysis. T0 export is redirected to other T1s. All details in the ggus ticket. Ronald: All attempts of restoring the DB have failed until now. Trying to reload an older backup.
      • INFN-T1: LFC unavailable. Alarm GGUS:61305 was sent at 17:23, acknowledged at 21:30 (UTC). The problem was fixed by restarting lfc deamons at 0:30. Production and analysis in IT cloud was off and T0 export to DATAPE was switched off. Now everything is back to normal.
        • GGUS ALARM ticket: did you receive the sms? Riccardo: SMS alarm system problem as well.
      • IN2P3-CC: SRM unavailable due to DCache problems: Alarm ticket GGUS:61313 submitted at 22:20 (site was in UDT since ~22:00 but we did not notice it). T0 export to IN2P3-CC_DATATAPE was switched off. Problem soved by 7:25 UTC. There was a memory error on the main component.
      • BNL: SRM problem. It was noticed by BNL team and fixed in 1.5 h. Thanks! The problem caused by faulty DNS mapping file.
      • TW: SRM performances degraded: one srm server died, some timeouts, but T1 was functional.

    • ATLAS had 30 % less ressources yesterday because of the above problems.

  • CMS reports -
    • Central infrastructure
      • 2 nodes of Oracle RAC for CMS Offline services crashed this morning (9:15 and 12:30). CMS called the Expert-on call after the second crash (sls monitor here). Several critical CMS central Services affected for some minutes (Frontier, DBS) + Offline DB streaming were down for 5-10 minutes, causing some latency (30 minutes) on condition data. Situation fully recovered at 1:20PM.
    • Experiment activity
      • Very long physics run on 48 bunches, which started on Aug 19 at 23:30. Input rates to CMS were below 400Hz and everything went smooth in terms of Offline Computing.
    • Tier1 Issues
      • T1_TW_ASGC : SRM is unstable since Aug 19, causing transfer inefficiencies (e.g. transfers from CERN, see Savannah ticket 116325, as well as SAM errors). Site admins are still working on the issue with aid from the CERN CASTOR team. Gang: SRM service restored 3 hours ago.
      • T1_DE_KIT : was in announced UNSCHEDULED downtime (SRM/DCache Outage) [2010-08-19, 21:13:00 [UTC] to 2010-08-20, 08:00:00 [UTC]]
        • however the CMS Site Downtime Google Calendar did not reflect the downtime, hence generating 2 false alarms... (Savannah 116319+GGUS 61312, Savannah 116324)
        • Our apologies to KIT for these false alarms !
        • The only good news of this issue is that the Savannah-GGUS bridging mechanism, including the closure of the original Savannah ticket, worked perfectly !
    • Tier2 Issues
      • The central CMS Frontier/Lauchpad server glitch mentioned above cause the CMS SAM CE (frontier/squid) tests to fail at many sites. Backward correction will be made if needed for the Site Readiness ranking evaluation.
    • MC production
      • On-going.
    • AOB

  • ALICE reports -
    • T0 site
      • No issues have been found with the new CREAM submission module installed at this site yesterday
      • Confirmation of Alice to update CASTORALICE to the latest CASTOR version, 2.1.9-8 (a transparent update for about 1:30h). Proposed date is Tuesday 2010-08-31, 9:00.
    • T1 sites
      • Transfers to CNAF again failing (GPFS problems?)
    • T2 sites
      • No remarkable issues to report. New CREAM module will probably be installed at Torino T2 this afternoon.

  • LHCb reports -
    • Experiment activities:
    • Issues at the sites and services
      • T0 site issues:
      • T1 site issues:
        • IN2P3 SRM out last night (solved)
        • SARA Oracle DB. NIKHEF and SARA banned.
      • T2 site issues:

Sites / Services round table:

  • BNL: the SRM log showed problems with DNS last night: gridFTP doors have 2 network interfaces: internal and external and reverse DNS entries were missing for the external interfaces. Quickly fixed.
  • PIC: ntr
  • NLT1: on Monday morning, both dCache and DPM will be at risk for maintenance.
  • FNAL: ntr; asking about the status of the new BDII software deployment. As no problem seen at CERN on the first node, the software will be deployed on the other nodes at CERN today.
  • ASGC: ntr
  • CNAF: still investigating the slow transfers between CNAF and BNL. LAN checked. No problem with LAN nor up to GARR starpoint. iperf shows no problem to Nikhef either. But GARR sees the same problem when they run tests from their starpoint to BNL. Michael contacted. Comment from Michael: needs further investigation with providers on the path between CNAF and BNL; ESNET still looking; USLHCNET should be involved as well.
  • RAL: ntr
  • GridPP: ntr
  • KIT: one dCache storage door migrated to a new server. This should be transparent if the hostname is not hardcoded in some client application/configuration. The old door is still running but will be switched off next week.
  • OSG: Belle experiment now registered in GGUS. So tickets should be automatically assigned to OSG.

  • Luca/CERN DB:
    • ATLAS incident:
      • Node number 3 of atlas crashed and restarted between 13:30 and 13:50, which has affected panda DB services. Panda DB services have relocated to surviving nodes after the reboot. Node 3 was back in production at 13:50. Investigations on the root cause are in progress.
      • Replication to SARA: SARA DB is still being recovered. We have performed a split operation of the streams setup so that replication to other Tier 1s is not affected by SARA downtime.
    • CMS incident:
      • This morning we had two issues of reduced capacity for the CMS offline DB services. Related to that we had 2 issues with streams which for the condition streaming led to a delay of the replication online-> offline up to 40 minutes.
      • Investigations are in progress on the root causes. Apologies for the inconveniences this may have caused.
  • Miguel: the upgrade of CASTOR at CERN to 2.1.9-8 during the technical stop has been announced in GOCDB.

AOB:

-- JamieShiers - 06-Aug-2010

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2010-08-20 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback