(ROC CE): Two questions about availability calculation.
Could we present what fraction of unavailability periods is considered by sites as non-relevant? Site admins fill in weekly reports and put such an information about each individual SAM test failure so the data is there. In our view, this information can allow to identify areas to improve in terms of availability.
Would it be possible to implement mechanisms for automatic removal of periods in which sites failed due to some monitoring-related problems like this one: SAM Result Link
It is hard to retro-fit gridview data and correct it.
Future transport based on Active MQ (with buffering) will help greatly.
Osman and Marcin will contact one another to try and resolve this.
Next week Marcin will provide some clear examples of the kind of thing that removing. Action item added.
Sven also seconds this motion for the DECH feeling.
gLite Release News.
gLite3.1 Update16 was released to production today, the update contains:
A new index on the attribute GlueServiceEndpoint, used by lcg-utils
UI: Bug fixes to jdl API (bulk submission) and gfal clients
dcache SE: Glue 1.3 clean ups and bug fixes
DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP from 2.6.2 -> 2.7.6b
GFAL version 1.10.8-1: creation of subdirectories with lcg-utils
(France ROC): A lesson learnt from CCRC08 is that some VOs don't consider the status published by a CE queue, so that they can wrongly submit on queue with a non-Production status. Indeed, at IN2P3-CC, for the purpose of an Atlas-Cms combined test, we had set 2 queues with a status "TEST" in order to restrict access to jobs that had explicitly required this status, but after a while we noticed plenty of regular ("production") jobs on those queues. Please check the queue status before submitting, it must be set to "Production".
Pierre Was seen with CMS and Atlas jobs.
CMS Can you produce a list of users?
Pierre will submit GGUS tickets to the VO when it happens again.
Upcoming WLCG Service Interventions
PIC has a long downtime nextweek from Monday 17th for 4 days.
FZK at risk on Saturday for database updates hosting 3D service.
Problems with CERN on Friday but all resolved promptly.
This week T1 T1 functional tests: all T1 vs PIC and all T1 vs Lyon.
Next week performance tests: throughput test, T0-T1-T2
CMS Service
Data certification, T0 status and reprocessing
All activities suffered from the LSF incident (full log by CMS at CMS.FacOps-IncidentCERNLSF-Feb28Mar07, discussed with Bernd/Ulrich at the FacOps meeting - see bottom of http://indico.cern.ch/conferenceDisplay.py?confId=30054). Hard week for RelVal atCERN, also (the LSF issue left CMS behind in release validations). FastSim production was proceeding fast before the problems (6k/15k proc jobs complete), and recovered soon after. --- Good progress on the StorageManager side, identified and configured the nodes to be used in the Global Run inMarch.
Re-processing
On CSA07 signal workflows, ~6M of GEN-SIM input evts have just arrived at T1's; ~17M processed evts last week. Processing running at FNAL also. FastSim production finalized with CMSSW_1.6.9 (+ 2 additional tags for the config files CMSSW_1.6.10) about ~100M PDAllEvents from the 3 soups (RelVal samples). No site issues at ASGC, CNAF, FZK, PIC, RAL; at FNAL, jobs take too long due to a dCache issue, being investigated; at IN2P3, problems in the pool area, several days without being able to merge jobs, now solved and production is already back on-schedule. --- Ran some post-CCRC reprocessing jobs with ATLAS: some lessons learned at IN2P3 and PIC (to long to report here).
MC production
~85M CSA07 Signal requested events were done, now available for reco. 56 workflows for ~3M requested events still to be done. Two types of problems (all CMSSW-related, so not worth mentioning here). 4 finished datasets (4M events, 1.45TB) are subscribed but not yet transferred to any T1 MSS. --- 1 DPG workflow (2 Mevts): GEN-SIM is done. Transferring. --- HLT: running (it's CMSSW_1_7_4, GEN-SIM-DIGI-RAW), 1 big workflows (10 Mevts) in production now, ~2 Mevts are done. --- Detailed summary of current production activities at http://khomich.web.cern.ch/khomich/csa07Signal.html.
Data Transfers and Integrity, DDT-2/LT status
/Prod transfers: proceed, 16 TB/week this week, no major problems. /Debug transfers: new links are commissioning with the new DDT-2 metric exclusively, since February 11th. Link exercising is proceeding, generally very successfully: 78% of the previously commissioned links have already PASSED the new metric as of 6 March 6th. We have 285 commissioned links (as of March 6th). The breakdown is: 55/56 T[01]-T1 crosslinks (only ASGC->RAL is missing); 142 T1-T2 downlinks and 83 T2-T1 uplinks, 38 T2 have at least 1 downlink and 37 T2 have at least 1 uplink, the interception is 35 T2 that have both; 5 T2-T2 links. First round of testing almost complete. Sites can take advantage of the gap before the second round to commission new links or recommission failed links. Real problems found, fixed during exercising, first "success stories" in troubleshooting being documented. --- Full details at CMSDDTLinkExercising.
SRM versions 1s Dropped from SAM
CMS also reported the problem that many SRMs had vanished from SAM testing: This in brief was attributed to some SRM version 1s publishing:
GlueServiceType:
srm
GlueServiceVersion:
1.N.0
while this is perfectly correct the SAM2BDII script is broken recognising this:
As nobody joined the session OSG-GGUS issues from the Operations' meeting of 2008-03-10
can OSG Ticketing system experts and the GGUS develoepers debug this please?
How does one explain that it remains open despite the comments in the public diary above?
Action Items
Newly Created Action Items
Assigned to
Due date
Description
State
Closed
Notify
Main.Marcin
2007-03-19
Marcin to produce a list of examples where a site failure is attributed to a central service failure. *Update 19th March*: Marcin supplied some examples. Problem is well understood, solution is less obvious. John to work with SAM & GridView team.
Please look into GGUS:33850 concerning transparent downtimes affecting site availability. *update: 20/3/08* GridView team has fixed the bug (CVS tag gridview-synchronizer-20080318). *update: 7/4/09*: ticket and action re-opened because also gstat needs a fix.
Sam team to investigate promptly the BDII2SRM script to recognise GlueServiceType/Version SRM/1.10 correctly. GGUS:33726, BUG:31940 *Update 13th March 2008* !BDII2SAM script now fixed, action should be closed following next meeting. *Update 31 March:* Script is fixed. Close.