CORAL LFCReplicaSvc overview
The LFCReplicaSvc is a plugin used by CORAL applications (including the COOL conditions database) to retrieve the physical database replicas corresponding to a given alias (dblookup functionality), as well as the Oracle usernames and passwords needed to connect to that physical database (authentication functionality).
Presently this tool is used only by LHCb.
In the future, it is foreseen that the LFCReplicaSvc will be discontinued, as the same functionalities (dblookup and authentication on the Grid) will be provided by the new CORAL server component, presently under development.
CORAL LFCReplicaSvc post mortem
The original version of this post mortem is still available in
PSSGroup.LFCReplicaSvcPostMortem that is linked from
LCG.WLCGServiceIncidents.
Thursday 11 June 2009 (LFC service degraded by CORAL use in LHCb)
Description of the problem
The problem was first reported in the
WLCG daily meeting minutes and in the corresponding LHCb report by Roberto for
Thursday 11 June.
Several jobs launched at CERN and external Grid sites simultaneously accessed the R/O instance of LFC for LHCb at CERN to retrieve the dblookup and authentication information necessary to connect to the relevant Oracle servers containing COOL conditions data. Combined with the inefficient access to LFC from the CORAL LFCReplicaSvc, this resulted in a service degradation for the LFC service for LHCb at CERN.
This is similar to the problem previously observed on
April 16, but it affected the R/O instance of LFC for LHCb instead of the R/W instance (because the LHCb job configuration had been changed this way as previously agreed).
Impact
The R/O instance of LFC (1 node) for LHCb was degraded. Connections were refused because the server was busy servicing over 60 connections and there were no more threads available for new requests.
Only LHCb was impacted.
Actions
LHCb stopped using the CORAL LFCReplicaSvc to translate dblookup aliases and retrieve authentication credentials.
- A first workaround was introduced by LHCb on Friday 12 June: COOL conditions data were shipped as sqlite files instead of retrieving them from the relevant Oracle servers using the dblookup and authentication information that should be retrieved from the LFC server.
- A second workaround was introduced by LHCb on Tuesday 16 June: Grid jobs read dblookup and authentication information from the two XML files dblookup.xml and authentication.xml (shipped together with the job configuration) instead of retrieving them from the LFC server via the CORAL LFCReplicaSvc component.
After introducing the second workaround above, LHCb was able to succesfully test access to COOL conditions data from Oracle. This is demonstrated in the figure below (courtesy of Philippe Charpentier), which shows the number of jobs executed on the Grid by LHCb during the last phase of STEP09. For comparison, the nominal number of jobs in LHCb is 1300 per day for 50% LHC duty cycle, hence not far from what was achieved.
The discussion between the CORAL development team and LHCb has been resumed to clarify whether the changes in CORAL are needed with higher priority than discussed during the last meeting in May. It is expected in any case that work on this issue will not start before July, when a former member of the CORAL team comes back to CERN to start a fellowship.
Thursday 16 April 2009 (LFC service degraded by CORAL use in LHCb)
Description of the problem
The problem was first reported in the
LFC support ticket
opened by Roberto on Thursday 16 April 2009.
Several jobs launched at CERN and external Grid sites simultaneously accessed the R/W instance of LFC for LHCb at CERN to retrieve the dblookup and authentication information necessary to connect to the relevant Oracle servers containing COOL conditions data. Combined with the inefficient access to LFC from the CORAL LFCReplicaSvc, this resulted in a service degradation for the LFC service for LHCb at CERN.
Impact
The R/W instance of LFC for LHCb (3 nodes) was degraded. Connections were refused because the server was busy servicing over 220 connections and there were no more threads available for new requests on the 3 nodes.
The problem was caused by R/O connections attempts from CORAL, but it affected both R/O and R/W users in LHCb, both from within CORAL and outside CORAL.
Only LHCb was impacted.
Actions
A first round of discussion took place on April 22 between the LFC and CORAL support teams (see
minutes
- note that a few minor corrections to these minutes are not included).
A meeting was then organised on May 7 including representatives from LHCb and the LFC, CORAL and WLCG LHCb, support teams (see
minutes
).
During the May 7 meeting it was agreed that the CORAL code should be patched according to the technical suggestions from the LFC team. It was agreed that these changes should be done soon, but no firm timescale was mentioned. It was explicitly agreed that the changes could not be done in time for the upcoming LCG56a release, which was finally released on May 20. A CORAL task was opened to modify the code (see
task #9774
), but the work has not started yet (as of June 18) due to limited manpower and other priorities.
During the May 7 meeting it was also agreed that LHCb would modify their job configuration to access the R/O instance of LFC at CERN, rather than the R/W instance. This has been done. It was agreed that accessing the R/O Streams replica of LFC at the closest T1 site would not yet be attempted, because this would require significant changes in the LHCb job configuration. Similarly, it was agreed that modifying the LHCb job submission pattern to add some throttling would not be attempted yet.
--
AndreaValassi - 18-Jun-2009