Week of 100517

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Rod, Jean-Philippe, Steve, Carlos, Jamie, Eva, Flavia, MariaG, Ian Iven, Maarten, Ian Fisk, Harry, Alessandro, Lola, Patricia, Simone, Stephane, Roberto, Tim, Alberto, MariaD, Dirk);remote(Jon/FNAL, Gonzalo/PIC, Michael/BNL, Rolf/IN2P3, Roger/NDGF, Ron/NLT1, Rob/OSG, Brian/RAL, Gang/ASGC, Angela/KIT, Alessandro/CNAF, Massimo/CNAF).

Experiments round table:

  • ATLAS reports -
    • Reprocessing:
      • Finished and data being distributed since Sunday
      • Will add data from last week, up to 15:00 today, to the campaign
      • CERN - ASGC poor network rate solved
      • Network problem in Europe(?)
    • FTS service problems
      • Channels between SRMs with identical paths, e.g.TRIUMF-CNAF, can delete source file
      • Good old-fashioned bug - keeps several channels offline until fixed.
      • Still occurs when checksum checking is disabled - according to code.
      • Jobs stuck in activated state for ever
      • only fix is to manually delete - have log with 'stuck' transfers.
      • Checksum comparison is case-sensitive. Castor 2.1.7 produces upper-case checksums - others are lower-case.
      • checksum checks disabled in UK cloud
    • PRODDISK full at some T2s (DESY-HH)
      • ProdSys-ddm integration problem when running much Reco at T2
      • Now delete output when successfully stored to T1, but with 1 day delay(to allow matched user jobs to succeed).
      • Need more aggressive cleanup strategy
    • SARA SCRATCHDISK full
      • user subscriptions routed via T1
      • same space used for analysis output
    • comment from Jean-Philippe: the FTS problems are fixed in FTS 2.2.4. Will be ready tomorrow for testing at CERN and Triumf.

  • CMS reports -
    • T0 Highlights
      • CMS Hit full utilization of the Tier-0 CPU over the weekend. First run of 2 x 2 at reasonable intensity had a trigger menu with a high rate.
      • Just warned there was a data loss incident at Castor due to a mis-configuration at the end of April after an upgrade. Waiting on full impacted file list (probably about 800 files), but expecting to be able to recover critical files from Tier-1 sites.

    • T1 Highlights
      • CMS is switching to multiple primary datasets at the end of the week. Currently we have 1 for Minimum Bias. This is placing a heavy load on the site with custodial responsibility for this dataset (KIT). Especially heavy since local stage out to SE storage is performed with lcg-cp whicg puts load on SRM. Considering using local protocols.
      • Ticket opened to RAL.

  • ALICE reports -
    • GENERAL INFORMATION: Pass1 and Pass2 reconstruction activities together with two analysis trains and 2 MC cycles. In general, stable behavior of the Grid resources during the last four days of vacations
    • T0 site
      • On Saturday, the update of ce201 was announced to ALICE by the site admin (migration of the system to SL5). The system was successfully tested and put back in production
      • Ticket GGUS_58234: Some of the ALICE data witten in the first week of May to CASTOR was found missing last Thursday. Lost data belong to the pool of 900 GeV. Issue was observed when the replication of raw data to T1 was beginning. (Data were previously reconstructed and ESD data are recorded. The problem affects to the raw data). CASTOR and ALICE experts are investigating the issue
      • On Sunday morning the Alice home directory of voalice13 (central vobox) was full. This created also some problems in the delegated proxy of the ALIEN USER registered with this vobox. The issue was solved in few hours
      • Night 14th to 15th: From 03:00 to 07:00 AM: No registration of jobs in ML (issue not affecting the production).
    • T1 sites
      • FZK: Restart operations of all services at the site were needed this morning (Bad behavior of the local PackMan service). Services back in production. Agents already submitted to the site however, the submission to the local CREAM-CE cream-1-fzk.gridka.de is extremelly slow, leaving the jobs in status REGISTERED for a very long time. GGUS ticket: 58267. Comment from Angela: seems ok but a lot of jobs. Comment from Patricia: the ticket was submitted because job submission was taking 30 seconds instead of 2 seconds. Could be due to the total number of jobs in the CREAM CE queue.
      • CCIN2P3: Submission through the CREAM-VO stopped. authentication problems with the local ALIEN USER (obsolete CAs? under investigation). Reported to the ALICE experts at the site
    • T2 sites
      • Deprecation of the 2nd VOBOX in Kolkata. site will continue the production with a single VOBOX
      • Investigation of the local VOBOC in Cape Town. Proxy renewal mechanism is not working. Following the issue directly with the site admin

  • LHCb reports -
    • 13-14-15-16 May 2010 (long Ascension weekend)
      • Experiment activities:
      • Long week end with many intense activities. Huge activity form the PIT with about 3 TB (~3K files) of data taken as of Sunday morning (amount expected to increase by Monday). Huge amount of MC simulation jobs and many user jobs too. Stripping-03 started while stripping-02 on previous data running until completion.Two pictures summarize the amount of jobs (dominated by MC) and data transfer of data from the pit. activity.png fromPIT.png
GGUS (or RT) tickets:
  • T0: 0
  • T1: 2
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • Serious Incident Report for data loss in Alice raw data recording at https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsAliceRecycled14May2010. Full impact is under investigation along with checking the other Castor instances.
    • Piequet call for high load on Castor CMS on Thursday due to a test of t0streamer and t0test.
    • gLite WMSes (wms203 and wms216) were reporting on Friday night the status Running (glite-wms-job-status) for all jobs submitted against CREAMCEs while the LB was correctly reporting these jobs Done. This is not a show stopper but - considering the high load put in the system, it makes the pilot monitoring system's life more difficult. Off line discussions with Maarten on Saturday highlighted this problems to be due to known bugs whose Savannah bugs were already open in particular this one (63109)
    • Received a lfc_noread alarm on Saturday evening at around 20:00. Checking the problem at 20:30 there was not any impact on user and the service was working fine. SLS and SAM reported some degradation of availability (and huge number of active connections) at around the time the alarm was risen.
    • Friday evening found different user's jobs failing because of a AFS shortage serving the application area. The jobs were just hanging and then eventually killed by the watch dog as consuming too few cpu.
  • T1 site issues:
    • SARA: SRM down (gSOAP error message). Submitted ALARM ticket (day time) on Saturday at noon preventing all activities going through. No progress (as of Sunday morning) since the aknowledgement received from Jeff as soon as it has been opened (GGUS 58244)
    • RAL: all transfers started to timeout and turning into a fairly important degradation of the service. This might be related with the shortage of space on M-DST and MC-M-DST spac tokens (less than 2.5TB free out of 69TB allocated) (GGUS:58253). Problem found to be due to a faulty diskserver removed on Saturday dropping the amount of available space. This has not been announced by RAL people. Actually the problem was announced by RAL on Sunday. 300 files need to be recovered from the old server. RAL would like to get a better email contact for LHCb.
    • IN2P3: shared area issues. CPU capacity not published correctly (normalisation problem).
    • KIT: according to LHCb, signals were sent to the jobs by the local batch system. Angela says no but the problems could be due to NFS.
  • T2 site issues:
    • USC-LCG2 too many pilots aborting; INFN-NAPOLI shared area problems

Sites / Services round table:

  • FNAL: NTR
  • PIC: NTR
  • BNL: NTR
  • NLT1: certificates issues fixed. Disk server crashed on Saturday but seems to be ok now.
  • RAL: consistency check has been run for ATLAS. Some dark data found. ATLAS should now say what to do about them. RAL would like to increase the number of FTS streams between BNL and RAL. New value could be 20-25. Agreed by Michael.
  • CNAF: Atlas Condition DB to be migrated. Waiting for the outcome of the Atlas meeting this afternoon.
  • OSG: one of the BDIIs down last week because of DNS problem. Will be back today at 2PM.
  • ASGC: timeouts in transfers: routing issue + network issue in Frankfurt area.
  • KIT: problem with one of the file servers for Atlas.

  • CERN:
    • Resolved OPN networking issue which was affecting transfer quality from CERN to ASGC for ATLAS and CMS. GGUS:58192.
    • High load for CMS on CASTOR because of a T0 test. CMS will probably replace stager-get by rfcp
    • High load on Atlas DB. Will be solved when the new hardware will be installed.
    • 3 SIRs must be produced:
      • GGUS problem on Wednesday (due to DNS)
      • network issue in Frankfurt area
      • data loss in CASTOR at CERN
    • registration to Jamboree in Amsterdam mid June has been opened.

AOB:

Tuesday:

Attendance: local(Rod, Jean-Philippe, Alessandro, Ian Fisk, Dirk, Gavin, Nilo, Eva, Ian Iven, Harry, MariaG, Simone, Jamie, Akshat, Patricia, Roberto, Maarten, Flavia, German, MariaD);remote(Jon/FNAL, Gonzalo/PIC, Michael/BNL, Rolf/IN2P3, Rob/OSG, Tiju/RAL, Jeremy/GridPP, Ronald/NLT1, Gang/ASGC, Xavier/KIT, Jens/NDGF, Pepe/CMS, Massimo/CNAF, Alessandro/CNAF).

Experiments round table:

  • ATLAS reports -
    • CERN Castor data loss
      • 10k files were in lost list
      • 1500 are test. Few DAQ files - not yet known how important.
      • Distributed RAW is all recoverable from T1
      • Some RAW files are not distributed, e.g. express stream, single beam, and thus lost
    • IN2P3-CC DATADISK almost full
      • reprocessed data
      • single 20TB dataset
      • expedite cleaning previous reprocessing versions

  • CMS reports -
    • [Data Ops]
      • Tier-0: data taking.
      • Tier-1: run backfill jobs at all sites until new requests come
      • Tier-2: LHE production at T2s.
    • [Facilities Ops]
      • Improving procedures and documentation for Computing Run Coordinator. CRC reporting to daily WLCG Operations calls: instructions on "how to" compile the report are ready.
      • Following up KIT SE load issues and suggestions how to handle CMS SAM tests, which were timeout due to the high load.
      • How to protect CMS production namespace areas is under revision. The implementation of the proper permissions is up to the site, and probably depends on the storage element technology being used.
      • Note1: Sites to provide SL5 UI/voboxes for CMS to run PhEDEx soon. By end of June it will be mandatory, as PhEDEx_3_4_0 will be released in sl5-only release. We ask the sites to provide sl5 UI/voboxes according to the deadline.
      • Note2: CMS VOcard (CIC-portal) to be upgraded soon.

    • T0 Highlights
      • Checking CMS data loss list from Castor. CMS was lucky that the period with the bug was mostly during LHC technical stop. The files lost were primarily binary streamers from the detector, which are replaced by root files within a few hours of data collection. These files are no recoverable because they aren't transferred to TIer-1s, but the information is available in the RAW root files that were not lost.
    • T1 Highlights
      • CMS is switching to multiple primary datasets during this week because the machine is continuing night physics running. We're working through the backlog of prompt skimming at KIT. This backlog is caused by sending all the data to a single primary dataset, we expect to complete in a day or so.
    • T2 Highlights
      • CMS will be starting some large scale testing of pile-up simulation at the Tier-2s. We will be trying to fill some TIer-2 farms with jobs that use 3 and 10 pile-up events per crossing. It's a high IO load.

  • ALICE reports -
    • GENERAL INFORMATION: Pass1 and Pass2 reconstruction activities together with two analysis trains and 2 MC cycles.
    • T0 site
      • Latest news GGUS_58234 (Some of the ALICE data written in the first week of May to CASTOR was found missing last Thursday): The ALICE expert have provided the CASTOR team with a list of files to be considered for recovery
      • GGUS_58133 (bad behavior of ce201.cern.ch): Solved and verified. ticket closed
    • T1 sites
      • Minor manual interventions required today at RAL and CCIN2P3 (2nd VOBOX only. configuration errors in LDAP) to put the sites back in production
    • T2 sites
      • Prague and RRC-KI: Both sites have already migrated to CREAM1.6 (gLite3.2). Stress tests ongoing at both sites to determine the good behavior of the system using CERN credentials

  • LHCb reports -
    • Experiment activities: NIKHEF and RAL transfer backlog recovered in spectacular fashion.
    • Issues at the sites and services
      • T0 site issues:
        • 'lfc_noread alarm received on Saturday evening, after some log's digging, has been found to be mainly due to a incompatibility between IPV6 support and DNS load balancing as implemented at CERN; almost always the same front-end box was picked up and not enough thread left on lfclhcbr01.
      • T1 site issues:
        • IN2P3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 )
        • RAL: we need to re-stage the rest of files that were in the old (faulty) disk server to another disk server to allow transfers out of RAL proceeding and to allow users to access them. Beside the impressive throughput SARA and RAL indeed we have many failures out of RAL because the source files supposed to be available in a disk server (disabled) and then not staging from tape is issued by CASTOR. So many transfers (all the ones that look for file on the removed disk server) are failing with the error "SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] source file failed on the SRM with error [SRM_FAILURE]".We know Shaun moved all migration candidates off to different servers in the same service class and the bad disk server has been disabled again. LHCb would like to ask a detailed analysis report of what happened since Saturday.
      • T2 sites issues:
        • PDC shared area issue; USC-LCG2 too many pilots aborting

Sites / Services round table:

  • FNAL: NTR
  • PIC: NTR
  • BNL: NTR
  • IN2P3: NTR
  • RAL: NTR
  • GridPP: NTR
  • NLT1: NTR
  • ASGC: NTR
  • KIT: NTR
  • NDGF: NTR
  • CNAF: intervention agreed with Atlas for tomorrow at 08:00

  • CERN
    • German: tapes in recycle pools for ATLAS and CMS have been identified. 8 for the former and 2 for the latter.Trying to recover the data for ALICE, but as tapes have been partially overwritten, it will be difficult. Trying anyway.

AOB: (MariaDZ) Opened https://savannah.cern.ch/support/?114518 to record progress on the SIR requested yesterday from GGUS on the .de DNS blackout of 2010.05.12. Is it necessary to register GGUS in 2 domains as the DNS problem occurs very rarely?

Wednesday

Attendance: local(Rod, Jean-Philippe,Oliver/CMS, Harry, Simone, Alessandro, Steve, Eva, Nilo, Ian Iven, Julia, Lola, Maarten, MariaD, Patricia, Akshat);remote(Jon/FNAL, Angela/KIT, Michael/BNL, Joel/LHCb, Gonzalo/PIC, Gang/ASGC, Rolf/IN2P3, Rob/OSG, Ron/NLT1, Massimo/CNAF, Tiju/RAL, Jens/NDGF).

Experiments round table:

  • ATLAS reports
    • CERN Castor data loss
      • Tape pool large enough that no recyclable tape was overwritten
      • All data recovered
    • May Reprocessing part II (week to 17th data) -preliminary plan
      • awaiting merge ESD trf so that ESD is large enough to write to tape
      • re-run part I ESD->AOD step in order to merge ESD there too
      • big hit on DATADISK space which is tight
        • accelerating cleanup efforts - reduce #replicas, delete old version
    • Question from Michael: when is the part 2 going to start? Could be tomorrow with existing transformation or Monday with new transformation. A note will be sent to sites.

  • CMS reports -
    • T0 Highlights
      • Follow up on data loss at CERN Castor: all lost CMS files have been recovered and are accessible again
      • Plan: upgrade T0 processing system, change software version and move to 8 primary dataset trigger table before stable running on the weekend
      • Problem with zero length files in CAF. Protection put in place.
    • T1 Highlights
      • Prompt skimming backlog at KIT: expect heavy load operation to continue for several more days, important skims for physics already caught up
      • situation will improve after deployment of multiple primary dataset trigger table by spreading the load over all 7 T1 sites and also deactivate commissioning skims which are not needed anymore
      • ReReconstruction operational test at FNAL at 95% and expected to be finished soon. Ran over all 2010 data collected so far. Expect to repeat that in production mode several times before ICHEP in July.
      • FTS transfer problems CERN-RAL now ok.
      • FTS transfer problems CNAF-CERN being looked at.
    • T2 Highlights
      • MC production as usual
      • Preparation for large scale pile-up simulations at T2 sites almost complete, expect jobs with higher than normal I/O load at sites soon.

  • ALICE reports -
    • GENERAL INFORMATION: Low number of jobs currently in production. Ther latests MC cycle is completed, therefore some uncheduled user analysis and Pass1 reconstruction jobs are the current production activities
    • T0 site
      • No issues observed. Both submission backends currently working
    • T1 sites
      • SARA: Ticket submitted yesterday concerning the local CREAM-CE system (authorization problems at submission time) has been solved and verified this morning. The system is back in production
      • The rest of T1 sites are all in production
      • Transferring files to T1s (NDGF for example)
    • T2 sites
      • Wuham: Wrong information provided by the local CREAM-CE, announced yesterday during the ALICE TF Meeting
      • Instabilities expected today at ITEP while configuring the 2 available VOBOXES in failover mode

  • LHCb reports -
    • Last 24 h/s 30K jobs run (20 from users the rest MC production and reconstruction/reprocessing). No major problems to report. Going to update the VO ID Card: the max. CPU Time sites should provide should pass from current 12000 HS06 min to 18000 HS06 minutes to hold properly reconstruction jobs with 3GB input file currently throttled to 2GB for CPU Time limitation. Nikhef has already updated. Thanks.
    • Issues at the sites and services
      • T0 site issues:
        • none
      • T1 site issues:
        • IN2P3: open a GGUS to request the correct publication of a new variable of the GlueSchema to be used to normalize CPU Time: CPUScalingReferenceSI00
        • IN2P3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 ). Both IN2P3 tickets are being worked on.
        • RAL: Shaun re-staged last night all data of the faulty disk server to another disk server to make them available to user. Storage elements at RAL re-enabled for users. Any news from RAL people concerning the SIR requested? The SIR has been provided and is linked to the minutes.
        • PIC: 10 minutes intervention to restart SRM (new postgres DB). Transparent.
      • T2 sites issues:
        • UK sites uploading MC jobs output are timing out against many T1's. Firewall issue. Contact person in UK following this up
        • GRIF and UFRJ-JF problems (too many jobs)

Sites / Services round table:

  • FNAL: NTR
  • KIT: NTR
  • BNL: network problem yesterday (routing loop). The service was down for one hour. The problem was due to a reboot of a router which triggered a routing recalculation and the loop produced had to be broken manually.
  • PIC: NTR
  • ASGC: NTR
  • IN2P3: NTR
  • NLT1: there will be next week a migration of one set of pool nodes to another set to replace a batch of bad disk servers. There will be a reduced throughput but no downtime.
  • CNAF: NTR
  • RAL: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100515_Disk_Server_Outage
  • NDGF: shutdown of FTS today because of the FTS bug (source removal). The new release has been certified but there are still packaging issues. Will be tested by Steve/CERN as soon as release available.
  • OSG: NTR

  • CERN
    • still trying to recover files for ALICE, but tapes have been (at least partially) overwritten so need vendor support.
    • ATLAS archive DB has been migrated to new hardware.

AOB:

Thursday

Attendance: local(Simone, Rod, Lola, Eva, Oliver, Harry, Jamie, MariaG, Maarten, Roberto, Ian Iven, Ricardo, Alessandro, Akshat, MariaD, Nilo);remote(Jon/FNAL, Michael/BNL, Gonzalo/PIC, Elisabeth/OSG, Massimo/CNAF, Ronald/NLT1, Jeremy/GridPP, Tiju/RAL, Angela/KIT, Rolf/IN2P3, Jens/NDGF, Gang/ASGC).

Experiments round table:

  • ATLAS reports -
    • May Reprocessing part II
      • RAW to ESD stage started 14:00
      • 30% size of May repro I
    • ESD merging step will output to T1 TAPE
    • Castor ggus ticket: https://gus.fzk.de/ws/ticket_info.php?ticket=58368
      • look like source errors, but cern says they are dest errors
      • multiple destinations involved, and synchronized.
      • maybe attach an FTS log to convince us.
      • comment from Ian: the problem is that sometimes the destination SE takes more than 3 minutes to return a TURL, at that time the CASTOR source TURL is not valid anymore. The failure rate is at most 1 percent. However the timeout in FTS for this does not seem to work which keeps the channel busy for nothing. Hopefully fixed in 2.2.4

  • CMS reports -
    • T0 Highlights
      • new trigger table with 8 primary datasets was put into operation, every CMS T1 site now receives custodially one primary dataset when new events will be taken
      • Plan: upgrade T0 processing system, change software version before stable running on the weekend
    • T1 Highlights
      • Prompt skimming backlog at KIT: expect heavy load operation to continue for several more days, important skims for physics already caught up
      • ReReconstruction operational test at FNAL in tails.
    • T2 Highlights
      • MC production as usual
      • Preparation for large scale pile-up simulations at T2 sites almost complete, expect jobs with higher than normal I/O load at sites soon.

  • ALICE reports -
    • GENERAL INFORMATION:
      • Production: Large number of reconstruction jobs currently running in the system together with 2 analysis trains (around 5000 concurrent jobs in the system)
      • Transfers: The refinement of the current transfer balance among the different T1 sites have to be redefine by the next week. In the last 24h ALICE has transfers arounf 7TB of data to NDGF, RAL, CNAF and FZK with good level of efficiency (http://dashb-alice.cern.ch/data/fts/20-May-10.html). Average transfer speed: 45MB/s
    • T0 site
      • Good balance between the CREAM-CE and the LCG-CE backend, over 1500 concurrent jobs
      • Due to the inefficiencies still observed in a subset of the dedicated ALICE CAF nodes, the experiment has required the replacement of the mentioned nodes. The motivation for this requirement is the importance of the mentioned nodes for the experiment timely data analysis.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Grenoble: Site out of production. Local batch system entering in timeouts while quering the local DB. GGUS-58382
      • INFN-Cagliari: authentication problems with the local CREAM-CE. site out of production. GGUS-58384
      • Russian federation: SPBSU nd ITEP under tsting of the local CREAM1.6 services

  • LHCb reports -
    • Experiment activities: New requirements for MaxCPUTime formulated on the VO Id Card. Currently running productions are affected by a severe application problem let jobs crashing. Urgent intervention form Core application people.
    • Issues at the sites and services
      • T0 site issues:
        • LFC-RO timing out requests again. (GGUS 58380): user still using old Persistency-LFC interface
      • T1 site issues:
        • IN2P3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 )
        • PIC: our contact person informed that SRM, between 11:30pm till 3:30 am (CET) had problems due to a disk controller. The problem should have been recovered automatically, but actually a manual intervention was needed. Looking at the reason.
      • T2 sites issues:
        • INFN-PADOVA: shared area problem

Sites / Services round table:

  • FNAL: NTR
  • BNL: NTR
  • PIC: NTR
  • CNAF: ?
  • NLT1: FTS full table space. Space manually increased and now set in automatic increase mode.
  • GridPP: T2s behind NAT in UK seem to encounter problems. If T2s using NAT outside of UK see also problems, Jeremy would be interested to know.
  • RAL: NTR
  • KIT: NTR
  • IN2P3: NTR
  • NDGF: NTR
  • OSG: NTR
  • ASGC: NTR

  • CERN: Too much activity on a CASTOR CMS pool last night. This is temporary because a new pool had been added and data was copied between pools.

AOB:

  • Please note that next Monday is holiday at CERN.

Friday

Attendance: local(Rod, Steve, Jean-Philippe, Harry, Eva, Oliver, MariaG, Jamie, Lola, Maarten, Roberto, Akshat, Jan Iven, Simone, Stephane, Alessandro, Flavia);remote(Jon/FNAL, Michael/BNL, Onno/NLT1, Jeremy/GridPP, Rolf/IN2P3, Massimo/CNAF, Xavier/KIT, Gonzalo/PIC, Gang/ASGC, Rob/OSG, Tiju/RAL).

Experiments round table:

  • ATLAS reports -
    • CERN AFS
      • 2000 running analysis jobs crippled the volume(s) for other uses
      • probably responsible for Db sessions alarm
      • data subscription scripts have afs dependency
      • on tests, IT pointed out that load was on rw volume, and not on ro, as it should be
      • SW setup investigation reveals this is true. SW install in /afs/.cern.ch keeps these paths, but should have /afs/cern.ch
      • Found instructions to relocate properly and working on it.
      • in the meantime, analysis throttled
    • May Repro II * PIC tasks delayed by missing DBRelease * T0-PIC channel very low rate, 1MB/s/file. All transfers were to single empty pool, due to cost calculation. * ASGC repro tasks assigned automatically - many fail on staging in DBrelease to WN * same file used by all jobs, hence in HOTDISK token * this should be replicated to several disk servers, but was not at ASGC * nevertheless, good news is that ASGC is online for reprocessing now (more disks, more replicas).
    • FZK-NDGF channel has problems
      • very difficult for an experiment to coordinate a fix to this. COULD CERN HELP PLEASE?
      • Get 2 T1s attention, T1s blame each other, the thread dries up....and repeat.
    • FTS delete source
      • TRIUMF installed patched version and re-opened affected channel(SFU). Looks good.
      • Not yet validated presumably, so will ask INFN-T1 after this(next week).
      • also NDGF with lower priority as only internal T2-T1 problem
    • INFN-T1 SRM problem overnight
      • MCDISK full
        • disabled DDM transfers, but those in the system continued
        • high failure rate on MCDISK, seems to prevent read/write on other space tokens
        • explanation last time this happened was mysql load

  • CMS reports -
    • T0 Highlights
      • all T1 sites approved transfer requests for the new primary datasets
      • Upgrade T0 processing system this afternoon, change software version before stable running on the weekend
      • Another piquet call about overloading LSF during T0EXPRESS migration, experts are in the loop and continue to adapt the procedures (still due to pool to pool copy)
    • T1 Highlights
      • Prompt skimming backlog at KIT: expect heavy load operation to continue for today
      • Request for 50 Million MinBias events will be run on T1 level as T2 level is saturated with other MC requests
      • expect to start pre-production for another re-reconstruction pass today, complete pass will start on Monday
    • T2 Highlights
      • MC production as usual
      • Starting large scale pile-up simulations at T2 sites, expect jobs with higher than normal I/O load at sites.
      • Transfer problems PIC to T2s because of a firewall issue and also because a single pool is used.
      • Comment from Simone: also transfer problems CERN to PIC (because of dCache cost calculation, a single pool is used). This problem is being discussed with dCache developpers and could be also discussed between T1s using dCache.

  • ALICE reports -
    • GENERAL INFORMATION: In terms of the job activity, most of the tasks currently running on the system come from specific analysis trasks rather than from reconstruction jobs (very low production at this moment). New setup of MC cycles are expected during the weekend.
      • In terms of transfers, an average rate of around 44MB/s has been maintained during the last 24h with peaks over 70 MB/s.
    • T0 site
      • Yesterday it was found an high traffic in about 1/3 of the ALICE CAF nodes affecting the outer parameter firewall performance. The security team proposed to define special exceptions in the HTAR and the proposed port for the 1094/tcp. In order to ease the firewall procedure, a new LANDB set has been created this morning containing the corresponding nodes and the CDB configuration has also been modified. Thanks to the PES and the Security team for their help.
    • T1 sites
      • Very low production for the moment at these sites. An increase in the number of jobs is expected for the weekend with new MC cycles.
    • T2 sites
      • Grenoble: GGUS-58382 submitted yesterday: SOLVED
      • INFN-Cagliari: GGUS-58384 submitted yesterday: SOLVED
      • Russian federation: SPBSU tests results of the local CREAM1.6 service: No issues observed. Service in production

  • LHCb reports -
    • Experiment activities: For this weekend are expected 13 bunch crossings which means LHCb will get equal data to what they already have for 2010. Problem with data processing are due to DaVinci being sensitive to new releases of SQLDDDB when it shouldn't be. This has to be fixed today. Ongoing processing will be stopped.
    • Issues at the sites and services
      • T0 site issues:
        • One of the service class in CASTOR (serving the FAILOVER and DEBUG space tokens) was highly overloaded yesterday evening to trigger concurrently worries on both Jan and Andrew. This is due to an unexpected load that LHCB started to put on this token in turn due to too many jobs crashing for the problem with the Davinci application. Suspiciously there is a concurrent problem affecting also many user 's activities accessing data in other space tokens (a.k.a. :M-DST,M-MC-DST). More information will be provided
      • T1 site issues:
        • IN2P3 Open a GGUS because ~10% of the jobs were failing yesterday with a shared area issues (GGUS 58283 ). Reproduced the problem planning to increase the reliability by adding read-only volumes. Rolf thinks that the issue could be due to an AFS cache management problem.
        • SARA: dcap port crashed causing many jobs failing to access data (GGUS 58396): monitoring should be improved.
        • SARA: WMS erroneously matching queue not supporting lhcb (low memory queues at RAL, causing many user jobs crashing) (GGUS 58399). The top BDII at SARA was not refreshing the views and publishing these queues at RAL as supporting LHCb. Problem fixed by restarting the TOP BDII.
        • RAL: request to increase the current 6 parallel transfer allowed in the FTS for the SARA-RAL channel in order to clean the current backlog draining too slowly.
        • CNAF: some FTS transfers seem to fail with the error SOURCE error during TRANSFER_PREPARATION phase: [INVALID_PATH] Requested file is still in SRM_SPACE_AVAILABLE state!
        • PIC: SRM seems to be down with all attempts to contact the endpoint are failing with error : [SE][srmRm][] httpg://srmlhcb.pic.es:8443/srm/managerv2: CGSI-gSOAP running on volhcb20.cern.ch reports Error reading token data: Connection reset by peer(GGUS 58430) ...issue went away by it self.
      • T2 sites issues:
        • INFN-TO and GRISU-SPACI-LECCE shared area problem

Sites / Services round table:

  • FNAL: NTR
  • BNL: NTR
  • NLT1: NTR but why is FTS down? /var full? GGUS ticket opened
  • GridPP: NTR
  • IN2P3: NTR
  • CNAF: NTR
  • KIT: NTR
  • PIC: NTR
  • ASGC: NTR
  • RAL: NTR
  • OSG: NTR

  • CERN: NTR

AOB:

  • Please note that next Monday is holiday at CERN.

-- JamieShiers - 13-May-2010

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback