Week of 110124

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, David, Eva, Ignacio, Jamie, Jan, Jarka, Maarten, Maria D, Yuri);remote(Gareth, Gonzalo, Jon, Michael, Onno, Paolo, Rob, Suijian, Vladimir, Xavier). Apologies: Stefano

Experiments round table:

  • ATLAS reports -
    • T0/CERN, Central services
      • SLS/Panda-monitor issue with monitoring (status in grey) on Friday at ~20:00, while Panda-mon works OK. SLS experts informed. Elog 21410,21414,21428.
      • CERN-PROD_EOSSCATCHDISK FT failures on user subscriptions. GGUS:66431 in progress: EOSATLAS SRM (Bestman) service had died yesterday evening. The service has been restarted, and the issue will be fixed via a patch during the day.
    • T1
      • IN2P3-CC_DATADISK ~180 file transfer failures: destination errors due to HTTP connection to SRM broken at ~3:30 on Saturday. GGUS:66407.
      • PIC: missing files on MCDISK space token caused file transfer failures on Saturday at ~10:00. Under investigation. GGUS:64409 in progress: also ~600 job failures due to Failed open file in the dCache. File not online. Staging not allowed. Corrupted ZFS system on a dCache pool at PIC(dc005).
      • SARA job failures continued (many group-production tasks are affected) due to missing input files. BUG:76267 and GGUS:66347 updated. These tasks need to be pre-assigned to NIKHEF because it holds the T1 AOD. Panda-developer informed and perhaps can have the long-term fix in Panda.
      • NDGF to BNL LCALGROUPDISK ~200 FT failures: TRANSFER error:FIRST_MARKER_TIMEOUT. GGUS:66416 assigned on Sat. at ~22:30.
        • Michael: NDGF-BNL transfers failed during a short time window in which limited bandwidth appears to have led to timeouts: probably a short network glitch
      • RAL ~100 FT failures to UKI-LT2-RHUL on Sunday, source/dest. errors: SRM_ABORTED. GGUS:66422 reopened,CASTOR experts informed.
      • FZK scheduled downtime: cloud was set offline, Savannah:118813.
        • Xavier: the second downtime was caused by 2 switches that broke down during the first downtime; they are OK now
  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • no issue
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples.
    • AOB
      • Due to CMS Workshop on Data/Storage access CRC can not attend the call. Please let me know by mail/twiki if there are questions or problems for CMS and we'll follow up

  • ALICE reports -
    • T0 site
      • NTR
    • T1 sites
      • KIT: problem with SE under investigation
        • Xavier: 1 file server suffered 2 kernel panics, under investigation
      • IN2P3: work in progress to synchronize read-only AFS volumes for SW area as done at CERN
    • T2 sites
      • KISTI back
      • GRIF_IRFU back, but MonALISA not working due to firewall changes
      • KFKI down because of network problem

  • LHCb reports -
    • Experiment activities:
      • MC only running so far. Submitted the request for stripping. MC09 cleaning up campaign.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • GridKA: Unscheduled Downtime after the scheduled one.
      • T2 site issues:
        • IC: jobs killed running out of wall clock time set equal to the Max CPU time on their queue.
      • Gareth: very large number (~10k) of DIRAC jobs waiting, quite unusual: is it OK?
        • Vladimir: such problems usually are due to incorrect number of waiting jobs published in the BDII
        • Gareth: will check

Sites / Services round table:

  • ASGC - ntr
  • BNL - nta
  • CNAF
    • at-risk network intervention on Thu
  • FNAL - ntr
  • KIT - nta
  • NLT1 - ntr
  • OSG
    • installed capacities were missing in the monthly T2 report, should be OK from now on
  • PIC
    • 1 disk pool for ATLAS became unavailable during the weekend due to ZFS corruption; ticket open with Sun/Oracle
      • Alessandro: which token?
      • Yuri: MCDISK
      • Alessandro: list of files?
      • Gonzalo: will do
  • RAL
    • outage for CASTOR DB upgrade next Monday
    • outage for CMS CASTOR next Mon-Tue to upgrade disk servers to SL5, allowing checksums to be enabled

  • CASTOR
    • short intervention on ATLAS head node to change LSF master went OK; the same will be done for ALICE on Tue
    • downtime for LHCb on Wed to upgrade DB to 10.2.0.5 and stager to 2.1.10
  • dashboards
    • all dashboards unavailable Tue 09:30-12:00 due to LCGR DB upgrade to 10.2.0.5
    • DDM dashboard upgrade moved to Wed 10:00
  • databases - nta
  • GGUS
    • see AOB

AOB: (MariaDZ)

  • Sent to all GGUS Support Units last week. Effective 2011/01/26: The new release of the GGUS portal will be made available on Wednesday, 26th of Jan 2011. Therefore the GGUS system will be in downtime at risk from 07:00 to 08:30 UTC.The new features will be announced through a broadcast right after the release. The downtime has been registered in GOCDB: https://goc.gridops.org/portal/index.php?Page_Type=View_Object&object_id=21963&grid_id=0 Please also check the ongoing work list & release notes: https://gus.fzk.de/pages/owl.php
  • A presentation will be given specifically for CERN IT Grid service managers (who now use PRMS) to answer GGUS tickets tomorrow 3:30pm in room 513-R-068.

Tuesday:

Attendance: local (AndreaV, Jamie, Gavin, Jarka, David, Steve, Maarten, Eva, Nilo, Stefan, MariaDZ, Roberto, Jan, Alessandro, Massimo); remote(Jon, Andreas, Xavier, Ronald, Francesco, Suijian, Rob, Jeremy, Rolf, Gonzalo, Oliver, Joel, Vladimir).

Experiments round table:

  • ATLAS reports -
    • PIC: Storage issue GGUS:66409 Update 25 Jan 10:53 - One pool of 100TB has been lost. Contacted Sun Microsystems support. Pool is unrecoverable. Preparing list of lost files excluding files with duplicates in other pools. Expected in 24-48h due to the extended list of around 513.000 files.
    • DB maintenances:
      • Today: Morning - LCGR upgrade (affects VOMS, FTS, LFC, and dashboards at CERN). Afternoon: Conditions DB upgrade.
      • Tomorrow: 2pm-5pm: Oracle upgrade on ADCR DB, all ADC services will be unavailable. Production&analysis jobs will drain starting before 8am in the morning, DDM will stop around noon tomorrow.
      • [Jamie: should we not understand better the DB problem on LCGR before deciding what to do with the ATLAS DB? Eva: the database is unavailable because one internal Oracle table is still being processed. Oracle advises that we should simply wait for this table to be processed, which should happen between 4pm and 4.30pm. Another option would be to stop the processing and recover the instance, but this may end up taking longer.The standby database is being prepared in case it is needed to switch to it. The problem may have been that the firewall caused the clusterware to hang; this is still being investigated. The plan at the moment is to go on with the ATLAS intervention anyway. Jamie: should check if ATLAS also has such a big table? Eva: no, this is an internal Oracle table. Jamie: can we set a decision point around 4pm to decide what to do if the database is not yet ready (probably better to reopen the service today that wait till tomorrow)? Eva: ok will take a decision at 4pm.]

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • in progress to move disk servers from castor pool T0Streamer to T0Export: remedy:736898
      • needed end of the week for the start of the HeavyIon zero-suppression pass
      • [Massimo: all machines are ready for the new pool. The delay is due to some problems in their reinstallation, this should be fixed soon, i.e. shoud not be in the way for CMS's plan to use them by the end of the week. Will give a status update tomorrow.]
    • Tier1
      • No outstanding issues. Last MC reprocessing pass in the tails.
    • Tier-2
      • No outstanding issues. MC going on full steam, 8TeV Spring 11 samples.

  • ALICE reports -
    • T0 site
      • Nothing to report
      • [Maarten: there has been progress in moving from SLC4 to SLC5, will say more tomorrow.]
    • T1 sites
      • KIT: problems with the SE are still there. Experts are working on it
    • T2 sites
      • GRIF_IRFU: negotiating the firewall configuration

  • LHCb reports -
    • Experiment activities:
      • MC productions. No issues. We have received a mail telling us without any consultation that ALL our voboxes and some disk servers will be unavailable the 3rd of February. It is absolutely impossible for us. We would like a consultation before any decision is taken. [Gavin: this is being followed up by vobox support.]
    • New GGUS (or RT) tickets:
      • T0: 1 (ALARM)
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Intervention on Oracle DB (scheduled) affecting many important critical services (VOMS/LFC/FTS). After the intervention can not access VOMS. (GGUS:66462). [Joel: initially a message had ben received that the intervention had finished, and only later (after LHCb sent an alarm ticket) another message that the downtime had been extended. It is the second time that this happens.]
        • [Joel: when will VOMS be back? Eva: as discussed earlier about ATLAS, a decision will be taken at 4pm.]
      • T1 site issues: ntr
      • T2 site issues: ntr

Sites / Services round table:

  • RAL (Gareth): have been in contact with LHCb regarding the large queue of batch jobs that is waiting to run.
  • FNAL: ntr
  • NDGF: had minor intervention on queues today, no user should have been affeceted
  • KIT: ntr
  • NLT1: ntr
  • CNAF: ntr
  • ASGC: ntr
  • OSG: ntr
  • GridPP: ntr
  • IN2P3: preannouncement of a major intervention on tape storage during the whole week of February 21-25. A shorter intervention on dcache will also happen during those days, and another outage on February 8. [Joel: thanks to IN2P3, one outstanding issue is finally closed (LHCb is now able to install the software on the shared area). Another issue is pending but it is not specific to IN2P3.]
  • PIC: following up a problem with an ATLAS pool. Engineers from the vendor are still working trying to recover the ZFS pool. Also ongoing is the query to dcache to find the list of affected files, will report this on the GGUS ticket. [Andreas: what was the problem with ZFS? Gonzalo: will prepare a report about this. Presently suspect a correlation with the intervention on UPS power last week, which had included the poweroff of the disk system: on powering back on, two of the four controllers had problems and the hardware had to be replaced.]

  • vobox: changing DNS alias to a vobox, do not expect any users to be affected
  • databases: reminder, intervention on ATLAS DBs tomorrow
  • Castor: intervention for LHCb tomorrow; intervention for ATLAS today
  • Grid: move of the dteam VO from CERN to Greece will be completed by tomorrow
  • dashboard: all down due to LCGR problem, this is being followed up

AOB: (MariaDZ)

Wednesday

Attendance: local (AndreaV, Jamie, Jarka, Maarten, Massimo, Ignacio, Roberto, Jan, Edoardo, Eva, Nilo, Alessandro, Ricardo, Stefan, David); remote(Andreas, Jon, Xavier, Onno, Rolf, Tiju, Suijian, Rob, Francesco, Andreas, Oliver, Vladimir).

Experiments round table:

  • ATLAS reports -
    • Couple upgrades scheduled for today:
      • 7:00-8:30am CET: GGUS upgrade.
      • 10am-11am CET: ATLAS DDM dashboard upgrade.
      • 2pm-5pm CET: ADCR DB Oracle upgrade. No ATLAS Grid service will be running.
        • All queues set offline at 8am to drain before the downtime.
        • Panda services, DDM stopped at 1pm.
        • ATLAS is ready for the ADCR Oracle upgrade.
    • Nikhef GGUS:66472 reported ATLAS user abusing home dir. Submission of jobs with production role temporarily disabled to Nikhef, now it is enabled for queue 'bij'. Issue is followed by ATLAS.
    • T1 downtimes for today:
      • FZK-LCG2 (until 6pm UTC) Maintenance of many services, e.g. firmware upgrade on disks and controllers as well as OS updates on border routers and reconfiguration of GPFS file systems. Full site down.
      • PIC Storage issue GGUS:66409 Update 26 Jan 10:50 - Storage data corruption at PIC. Affected data size increased as now two dcache pools are damaged in the same DDN unit. Sun Microsystems and DDN storage engineers are working in the incident. Origin of the problem located Friday 21 at 17h when one of the storage servers rebooted. List of lost files still ongoing. Now the major worry is to understand how to avoid this incident type in the future.
    • Yesterday's DB upgrades finished. Many thanks!
      • Yesterday's ATLR upgrade on schedule.
      • Yesterday's LCGR upgrade: Can CERN IT DBAs please summarize the issue and lessons learned? Thank you. [Eva: created an LCGR post-mortem. As discussed yesterday, the unavailability was due to the slow progress in the deletion of the Oracle internal table created during the upgrade. The script completed at 4.10pm and the database was back shortly before 5pm local time. Still investigating why this table was created (this was not observed on other databases. Oliver: will make some comments during the CMS report.]

  • CMS reports -
    • CERN
      • LCGR database upgrade issues from yesterday
        • Some comments about the information flow (only mild complains about the confusion of the time stamps, posted mainly for post mortem purposes)
          • CMS was affected (significantly I would say, VOMS, FTS), I didn't complain yesterday because Atlas already made all relevant points
            • especially a tutorial planned at FNAL ran into problems because proxies could not be created, would have been fine if upgrade would have been completed as planned
          • SSB listed that update would be given at 4 PM (assumed CET) [Eva: yes this was CET, i.e. local time.]
          • at 5 PM CET, no update given yet
          • opened team ticket 66470 to ask for update, no answer yet (was wrongly assigned first back to CMS, wonder what I did wrong)
          • update on SSB listed for 16.55 although the update really appeared about 5.55 PM CET
          • DB was back and voms proxy requests succeeded again
          • [Jamie: the experiments are right, we should improve the information flow to provide timely and correct information The issue of information flow will be followed up with all groups and will be discussed at one of the next T1 service coordination meetings. In particular, we should keep in mind that other people outside CERN are affected.]
          • [Oliver: suggest to always specify CET if this is the time zone, as otherwise it may be confusing for remote users (who might assume UTC, for instance).]
      • Castor pool move from T0Streamer to T0Export
        • Any update? I understood yesterday that it cannot be guaranteed that the pools are moved by the end of the week (contrary to the minutes). [Jan: a few machines have been correctly installed, this gives us confidence that all should be ok by the end of the week.]
      • [David: for info, the SAM test history shows that tests were not executed at FZK since January 22, this is being followed up.]

  • ALICE reports -
    • T0 site
      • 16 news CAF nodes have been added and configured this morning to the ALICE CAF.
      • Migration of voalicefs01-05 from SLC4 to SLC5 ongoing
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • MC productions. No issues. We have recived a mail telling us without any consultation that ALL our voboxes and some disk servers will be unavailable the 3rd of February. It is absolutely impossible for us. We would like a consultation before any decision is taken. [Roberto: this has been discussed and the intervention has been postponed.]
    • New GGUS (or RT) tickets:
      • T0: 1 (ALARM) STILL OPENED .. Why?
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Intervention on Oracle DB (scheduled) affecting many important critical services (VOMS/LFC/FTS). After the intervention can not access VOMS. (GGUS:66462) still opened ... [Checked after the meeting that this is now closed.]
        • intervention on castorlhcb
      • T1 site issues: ntr
        • GRIDKA annouced as "AT RISK" but in the message it was written "site full down" so it should be OUTAGE no!!!!! [Andreas: this may have been a bug in GOCDB or an error by myself. Also sent a mail this morning to wlcg-operations to announce the interventions.]
        • [Roberto: the issue with pilot jobs at RAL has been understood and fixed.]
      • T2 site issues: ntr

Sites / Services round table:

  • NDGF: ntr
  • FNAL: ntr
  • PIC: update about the issue with ZFS pools. Still ongoing: get the list of affected files. Still ongoing: understand where the corruption came from. Will provide a SIR with more details when this is understood.]
  • NLT1: update on GGUS:66287 submitted by LHCb. A CernVMFS update was installed on the worker nodes. Local tests show that this should improve performance significantly. Did LHCb observe any speedup? [Vladimir: will check and reply on the ticket.]
  • IN2P3: ntr
  • RAL: had to kill Alice jobs that were hanging since one week. [Maarten: if they were hanging for a week, they were probably not important.]
  • ASGC: ntr
  • OSG: had problems on Tue in preparing the weekly report, as this depends on a URL that changed. Where is this normally announced? [Maarten: suggest to contact Laurence/Joanna. Rob: this was done, it has been discussed with the developers. Maarten: suggest to discuss these issues at the monthly interoperability meeting organised by Anthony from FNAL.]
  • CNAF:
    • Tomorrow there will be a network intervention.
    • Today a GGUS alarm was received; this was just a test and will be closed, but did not manage to close it yet. [Maarten: may be due to GGUS upgrade, suggest to contact the GGUS team.]
  • Gridka: maintenance ongoing, expect to finish by 6pm local time.

  • Network: two interventions tomorrow
    • Link to Bologna will be upgraded from 10 to 20 GB. [Alessandro: as the CERN to CNAF bandwidth is doubled, is there any discussion about increasing the bandwidth from CNAF to other sites? Edoardo: not tomorrow, but yes this is being discussed.]
    • Will reroute link to Triumf, it will be down 4pm to 6pm tomorrow.
  • Castor: upgraded stager this morning to 2.1.10 and oracle to 10.2.0.5
  • Grid:
    • dteam move has not happened this morning, should happen this afternoon
    • will check about the alarm tickets

AOB: none

Thursday

Attendance: local (AndreaV, Jamie, MariaDZ, Jarka, Gavin, Steve, Eva, Alessandro, Jan, Eddie, Mike, Maarten); remote (Jon, Rolf, Ronald, Kyle, Gareth, Andreas, Foued, Paolo, Suijian, Gonzalo, Andreas, Vladimir, AndreaS, Daniele).

Experiments round table:

  • ATLAS reports -
    • ATLAS has new data project tag, the data11_2p76TeV. Please note that this data project tag replaces data11_2p8TeV. [Alessandro: this will be useful for a better organization of the data at T1 sites.]
    • FZK-LCG2: ATLASSCRATCHDISK locality unavailable, GGUS:66614
    • RAL-LCG2 ATLASDATADISK: 1 file with checksum mismatch, GGUS:66709.
    • PIC storage status: GGUS:66409 Status update ~9am CERN time: We are more optimistic now, though the exact amount of lost files is still unknown. We finally got in touch with a high level ZFS expert from Sun support yesterday. It was working on recovering the filesystems for several hours and by yesterday evening it seemed the filesystems could be mounted. We are now trying to save as many files as possible to new pools. As soon as we know the final list of files that have been lost, we will publish it.
    • Maintenances yesterday:
      • ADCR downtime was on schedule and successful, thank you very much!
      • Dashboard and DDM Central Catalog upgraded successfully, many thanks!

  • CMS reports -
    • Apologies, Oli is on a plane right at the time of this meeting and therefore has a hard time connecting
      • I filled the TWiki and will follow up later.
    • CERN:
      • Update on Castor pool move from T0Streamer to T0Export?
        • Yesterday Jan was positive that they solved the installation problems and that they can add the pools to T0Export by the end of the week. [Jan: the pool is now installed as of 9am this morning and the ticket has been closed.]
    • KIT T1
      • downtime from yesterday over, still seeing some services down (PhEDEx)
      • remark yesterday about SAM CE tests not running since Jan 22. might be also related to the two downtimes on Monday and yesterday which required draining of the queues. Still followed up by KIT site contacts and Andrea Sciaba.

  • ALICE reports -
    • T0 site
      • Migration of voalicefs01-05 from SLC4 to SLC5 ongoing
    • T1 sites
      • KIT: is back in production and running ALICE jobs
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • MC productions. Re-stripping launched yesterday
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 1
      • T2: 0
    • Issues at the sites and services
      • T0
        • Experienced issues accessing data files on lhcbrdst SvcClass and also resolving tURLs. One part of the problem is a wrongly formed tURL that triggers a d2d copy (bug in Gaudi) but there is something not yet fully understood in the communication btw CASTOR and xroot once this copy is performed. Ponce investigating on that. [Jan: should open a GGUS ticket about the Castor issue. Vladimir: ok, will do.]
      • T1 site issues:
        • GridKA: shared area issue (not mounted after the intervention). (GGUS:66700). All jobs failing, also SAM reports that.
        • NIKHEF: Problem with CERNVMFS with some s/w not properly propagated down to their cache. They need to kill the CVMFS process for the lhcb mount point on each WN and have to remount it. Banned NIKHEF to drain jobs in there. [Ronald: killing the CVMFS process is now completed and LHCb can start using the site again.]
      • T2 site issues: ntr

Sites / Services round table:

  • FNAL: ntr
  • IN2P3: ntr
  • NLT1: nta
  • OSG: two issues
    • Flavia opened ticket about BDII. [Jon: are the missing glue attributes mandatory now? They had been declared as optional. Maarten: please check this with Flavia. She is trying to clean the IS so that all sites publish 'reasonable' information.]
    • Did not receive the alarm that was expected. [!MariaDZ: will check this and update the minutes. Checked after the meeting: the issue is discussed in https://savannah.cern.ch/support/?118839 ]
  • RAL: reminder about interventions next Monday, the Oracle DB for Castor will be updated (queues drained on Sunday) and the CMS diskservers will be upgraded to 64bit O/S to enable checksums (will finish on Tuesday).
  • KIT: full day of intervention yesterday. This took longer than expected, downtime was extended by 1.5 hours. Some controllers were still in basd shape after the intervention, which explains the issues seen by ATLAS, CMS and LHCb. The engineers are working on this, all should be solved in 2 hours from now.
  • CNAF: intervention ongoing, will finish this afternoon.
  • ASGC: ntr
  • PIC: update on the ATLAS pools. Good news: with the help of an engineer, were able to mount the filesystem and see the files yesterday evening. Started to move the data (200TB) before the system can be restored. Should be able to recover most of the files, but do not have yet a full list of the affected files. It will take a few days to fully recover. [Alessandro: thanks for the followup. Did not observe yet a high percentage of failures. If they do, will exclude PIC from reading and update the GGUS ticket.]
  • NDGF: ntr

  • Services: ntr

AOB: (MariaDZ)

  • Results of the GGSU-to-Italian_ticketing_system problem shown with the test ALARM to CNAF here: https://savannah.cern.ch/support/?118839
  • Requested functionality from GGUS entered operation yesterday. You may now upgrade a TEAM ticket into an ALARM one IF you are not only a TEAMer but also an authorised alarmer. Details in http://savannah.cern.ch/support/?118153. [Alessandro: should we try this out? MariaDZ: I have already tested this, please do not make another test, just start using it when you really need it.]
  • GGUS Issues
Ticket\Experiment Creation date Assigned To Last update Status Comment
GGUS:61440 ATLAS CNAF-BNL Transfers 2010/08/23 OSG(Prod) 2011/01/19 In progress No more news since the mid-Jan report at the daily meeting. If no more an issue please remove from the ATLAS twiki.
GGUS:66409 ATLAS 2011/01/22 NGI_Ibergrid 2011/01/27 In progress dcache pool corruption problem with a loss of 200 TB of files. 1M files affected and checking if the files had duplicates in other pools turns to be a lenghtly process.(A.Pacheco, PIC)
GGUS:66389 CMS 2011/01/21 NGI_DE 2011/01/27 In progress CRAB analysis access problem at DESY-HH.
GGUS:66394 CMS 2011/01/21 NGI_DE 2011/01/27 In progress CREAM CE authentication problems. Applies also to tickets 66396, 66398, 66400, 66401, 66402, 66403:!
GGUS:66441 CMS 2011/01/24 VO Support (CMS) 2011/01/27 Waiting for reply Unexpected VO CMS/Certifcate expiration related to BELNET signing certificate. Was in wrong status. Changed by MariaDZ.
GGUS:66521 CMS 2011/01/26 VO Support (CMS) 2011/01/26 Assigned CMS software installation ticket for BG05-SUGrid.
GGUS:66470 CMS 2011/01/25 ROC_CERN 2011/01/26 In progress TEAM ticket. Just sort out communication issues as an aftermath of the LCGR database upgrade.
  • Comments on the GGUS Issues above:
    • [Alessandro about the CNAF-BNL issue: this is not yet closed, but we do not need to discuss it every Thuersday.]
    • [Daniele about the CMS tickets: will remind the computing coordinators to have a look.]
    • [Alessandro about the list of issues from ATLAS and CMS: it is important to have a look at this table, but it would be useful to sort the issue by priority and filter out the less relevant ones so that we only discuss the most important ones.]

Friday

Attendance: local (AndreaV, Jamie, Gavin, Jarka, Stefan, Steve, Ignacio, David, Mike, Maarten, Alessandro, Marcin); remote (Jon, Xavier, Xaver, Rolf, Kyle, Tiju, Jeremy, Suijian, Onno, Paolo, Christian, Vladimir, Oliver).

Experiments round table:

  • ATLAS reports -
    • PIC ATLASDATADISK - user has no permission to write into path /pnfs/pic.es/data/atlas/atlasdatadisk/... GGUS:66744 - fixed: looks like a dCache's gridftp bug: dCache - ATLAS user is directory owner, PFNS - root is the directory owner. Manually fixed by site admin. Thank you!
    • PIC storage status: GGUS:66409 Status update ~noon CERN time: Data extraction is going well, and hope to have today the list of affected files. Taking into account the stats from the last 24 hours the aftection of file corruption is at the order of 0.003%, extrapolating this means a total of 2.5k files out of one million files that were resident at the two affected pools. We are thinking about strategies to serve files soon, but the conservative approach means a week (move 200TB data to a different pool, repair/change disks and then move back, unfortunatelly we don't have 200TB of dcache disk spare at this time what would ease the work enormously and data would be online now). PIC cloud still offline for production and analysis.
    • FZK-LCG2: disk pools almost all online, remaining pools promissed to be back during morning. Can FZK please update us? Site enabled for production and analysis in the morning. [Xavier: three pools remain offline with disk I/O errors.]
    • Taiwan-LCG2: Emergency CASTOR downtime, fixed already.

  • CMS reports -
    • CERN
      • CERN IT-DB operator informed at 5 AM about reboot of CMSR2 DB machine
        • communication went as planned, excellent, CRC was called
        • impact on PhEDEx which recovered automatically
      • Castor: pools have been added to T0Export, many thanks.
    • T1_FR_CCIN2P3
      • SAM CE test errors GGUS:66740, admins are following up
    • GGUS ticket list from Thursday
      • most tickets have been worked on this week, only one has been updated to ask for a status update

  • ALICE reports -
    • T0 site
      • Operations of configuration finished in new CAF nodes.
    • T1 sites
      • RAL : GGUS 66711. There were globus_ftp_client errors in the log of the AliEn CE service in lcg-alice.gridpp.rl.acuk. Maarten discovered that there were a misconfiguration of the GLOBUS port range. The problem seems to be with one of the CEs lcgce09. Experts working on solving the issue
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • MC productions. Re-stripping.
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • xroot server down. [Ignacio: sorry, forgot to enable the alarm after the upgrade on Wednesday, hence this was not restarted automatically. This is in GGUS:66731.]
      • T1 site issues:
        • IN2P3: Most Re-stripping jobs failed. Under investigation. [David: the SAM tests for CE and SRM fail at IN2P3, probably due to this issue, is there a GGUs ticket for this? Vladimir: will open a GGUS ticket.]
        • GRIDKA: FTS problem
        • It seems due to re-stripping jobs we overloaded SRM,
      • T2 site issues: ntr

Sites / Services round table:

  • FNAL: ntr
  • KIT: intervention on CE4 this morning, as planned. There were problems in the dcache connection to tapes. Jobes were jamming dcache and LHCb had problems with production jobs: this is now fixed.]
  • PIC: the system has now restarted, observe only 3 permille job failure rates. As there are 1M files (on a 2PB system), this indicates that around 3k files are lost. The full list of lost files is not yet available, should be by today or tomorrow.
  • IN2P3: about the CMS issue on SAM tests, there is a GGUS ticket assigned, based on a savannah ticket that is already closed, is this normal? [Oliver: no, not normal, asked Peter to close the GGUS ticket.]
  • OSG: two points.
    • The alarms were resent yesterday and all went as expected.
    • Flavia consolidated all the information into a single GGUS ticket.
  • RAL: ntr
  • GridPP: ntr
  • ASGC: ntr
  • NLT1: ntr
  • CNAF: ntr
  • NDGF: warning about an SRM intervention next Tue-Wed.

  • Database services: as reported by CMS, CMSR2 restarted this morning. The problem was a large trace files that was being rotated, this was solved by removing this trace file. The same issue had been observed in November on ATLR. Now implementing a solution to avoid this sort of issue in the future.

AOB: none

JamieShiers - 19-Jan-2011

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2011-02-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback