Week of 050919

Open Actions from last week:
  • check that correct SEs are published in prod. BDII. (Patricia)
  • Need to get procedures for dCache (Maarten) DONE
  • Publish info into Wiki about info system (James/Gavin) DONE
  • QF for BDII - and guide for T-2 or expt to set it up (Gavin)
  • Need to put in caching to BDII provider for FTS, if expensive (e.g. channel lookups) (Gavin)

Actions on Hold:

  • Need to restart castor2 db to apply patch - schedule for gap between alice and cms (Vlado) SCHEDULED, WAITING 4th OCT
  • Get "dev-kit" for FTS API to write LFC-FPS (Paolo, David) DONE

On Call: James + Sophie

Monday:

Log: cmsprod pool got filled up via 25000 tape recalls.

New Actions:

  • Test the GC triggering in the DB for the WAN pool (olof) DONE
  • Check on how the old phedex pool nodes are now configured (Olof) DONE
  • Check with nilo/eric on DB intervention - schedule for next week (olof)
  • separate out one node for SLC4 deployment (JanVE/James) DONE
  • Discuss with Lassi how much data he weill stage and work out where it goes (olof) DONE

Discussion:

  • CMS will start this week - ALICE will also do some data movement this week.

Tuesday:

Log: nothing to report

New Actions:

  • Jan/Sophie: get LFC DLI ports open. DONE

Discussion:

  • atlas LFC issues - we'll open the port.
  • QF - BDII + channel state today/tomorrow
  • mproxy problems - can't have a renewable and retrievable proxy for the same user on a single myproxy server.

Wednesday

Log: CMS have seen problems with castorgridsc - all transfers are hanging. Also, t support mails to shiva don't seem to be getting through.

Actions:

  • arrange meeting with GSSDATLAS for LFC Production plans (Simone) (week of 3rd October)
  • Arrange CMS meeting (Jamie) (3.30) DONE
  • Check why mails don't get into SHIVA (James/Zdenek) DONE
  • Check with olof on Castor2 problems (James/Jan) DONE

Discussion:

  • gLite QF RPMs today
  • lxshare220d will be for SLC4

Thursday

Log: Problems with Oracle DB last night.
  • Problem with castor oracle logging node reported a corruption with the file at the OS level and stopped the instance.
  • at 19.00 stager DB stopped - load became very high and then stopped. Could not be restarted other than hard reset - this triggered FS rebuild and a long recovery time.
  • at 03:00 same problem happened.

Actions:

  • Castor2 upgrade next week - check with sebastien (delayed to week of 3rd October) (olof)
  • operators acted incorrecty on a LCG_MON_GRIDFTP alarm (lxshare025d) - need to check procedures (vlado) DONE

Discussion:

  • there were three problems on wednesday
    • trigger for GC was blocking FSs that had been selected for movement for files that needed to be copied into the pool FIXED
    • Error in the SRM - time window where tape recall is involved on an SRM get before the system knows about the file - failure of the first getRequestStatus CODE TO TEST
    • Problem with pool imbalances - J-D added a new policy to weight the size and available space of the FS = FIXED=

Friday

Log: castor DB crashed at 2pm. came back quicker due to the ext3 FS. Cause again traced to procedure called by GC trigger.

Actions:

  • check possiblility of using LCG Quattor WG componets for LFC/DPM/... (Jan/Vlado)

Discussion:

  • We should look at LCG Quattor WG components for LFc/DPM/R-GMAlcg-gridftp-mon, etc...
  • Seems to be the procedure called by the trigger that is casuing the problem. GC is now being done manually on both WAN and internal CMSPROD pool. Will carry on like this for now.
  • Still no QF from gLite.
  • Delay Castor2 upgrade for one week (Week of 3rd October)


This topic: LCG > WebHome > LCGServiceChallenges > ServiceChallengeMeetings > SCDailyMeetings050919
Topic revision: r10 - 2007-02-02 - FlaviaDonno
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback