Points to remember (compiled from meetings and notes):

  • The objective of the popularity service is only to get a ranking of how popular a dataset is, not to identify lost/corrupted replicas and source of data access failure.
  • The unit of information will be the LFC directory.
  • Implementation should be inside LHCbDirac
  • The only information that should be sent by the job are the LFNs and respective Storage Elements. More precisely, in the finalization step, already existing also for users' jobs, a new module should be added. The information will be taken from:
    • The XML file (summary.xml), where the list of input files (LFNs) is reported. The "XML summary" is produced by the LHCb applications independently. It is inspected by LHCbDIRAC to know if the application reached the end of the computation (i.e. if it has processed all the input events).
    • The SE has to be obtained from DIRAC, it is not in the XML file. We can also record the site if easier, but formally the SE is the real interesting entity. The job knows it when resolving the input data.
  • Information should be passed onto the finalization step on the files that were actually used by the job, not requested (in case for example someone has 20 input files but reads only 10 events). This also means that partially used files are reported to the Popularity Service?

  • This new module should send the information to the Popularity Service which will then populate the StorageUsageDB. Currently this database (volhcb12 development machine) has two tables: Popularity and DirMetadata. The Popularity table contains traces (DID, Count, SEName, InsertTime) of individual jobs. The jobs themselves only send entries of type (SE, DirDict ) where DirDict = { dir1: count1, dir2: count2,..}. Change: Instead of inserting the Directory ID, which implied a query to the table with the LFN directories, simply insert the LFN. In this way, we save a query. And then, in an asynchronous way, the Popularity agent when it creates the reports, it will check if the LFN paths stored in the Popularity table exist in the LFN directory table, and will raise an error in case they are not there. DirMetadata contains bookkeeping information: configName, configVersion, Conditions, Proc.Pass, eventType, fileType, production etc.

Implementation details/concerns

The existing steps and modules for jobs are available here. For LHCb these could be e.g. Gauss, Boole, Brunel,DaVinci, Bender, etc. The LHCb job workflow is defined in LHCbDIRAC/Interfaces/API/LHCbJob.py

For example, in the method

def setApplication(...)
there is a call to
 __getGaudiApplicationStep(..)
which controls the definition for a Gaudi application step. Two modules are included by default (and several parameters): GaudiApplication.py and UserJobFinalization.py.

Another example, in the method

 setRootPythonScript(...)
there is a call to
__getRootApplicationStep(...) 
the two defined modules for a Root script step are: RootApplication.py and UserJobFinalization.py

etc.

However, theoretically a job can have many steps, some of which can be standard Gaudi applications, but also custom (Root, Python) scripts.

Question Are we interested in collecting data access information only for user jobs with standard Gaudi applications like Brunel, DaVinci or also custom scripts, executables? NB:In case of multiple steps the UserJobFinalization module will only enable itself only at the end of the workflow. Question But the "finalization step" referred to in meetings is actually a module, not a step (a DIRAC step) ? It is in fact added at the end of each step of a multi-step job, but enabled only once, at the end of the job execution...so which "finalization step" are we talking about? Presumably, a separate module should be developed for collecting and sending the file usage data to the Popularity service. This UserJobFinalization module seems to be responsible only for uploading job output data. This means changing the logic of the LHCbJob class to include this module in the workflow steps. Let's say we care only about standard Gaudi applications, then in the getGaudiApplicationStep(...) method, the following additions should be roughly made: (say FileUsage.py is a new module in LHCbDIRAC/Workflow/Modules)

    moduleName = 'FileUsage'
    fileUsage = ModuleDefinition( moduleName )
    fileUsage.setDescription( 'Blabla module ')
    body = 'from %s.%s import %s\n' % ( self.importLocation, moduleName, moduleName )
    fileUsage.setBody( body )
...
    step.addModule( fileUsage )

Question There seems to be a Savannah task on the use of the XML summary. Is this considered finished? It is not included in the job workflow as far as I can see. I see two (identical?) modules developed:

VS

Are they any different? ALERT!The AnalyseXMLLogFile implementation parses the application log AND the summary.xml, but it uses LHCbDIRAC/Core/Utilities/ProductionXMLLogAnalysis.py which requires step_commons['listoutput'] to be defined.. user jobs may lack it, so it crashes :-/ Example output

2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Initializing $Id: DiracPopularityService.txt,v 1.2 2012/01/11 13:41:50 danielar_40nikhef_2enl Exp $ 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile DEBUG: {'ParametricInputData': '', 'TotalSteps': '1', 'JobName': 'Name', 'Priority': '1', 'SoftwarePackages': 'DaVinci.v29r2', 'JobReport': <DIRAC.WorkloadManagementSystem.Client.JobReport.JobReport instance at 0xd11bcf8>, 'LogLevel': 'debug', 'OutputSandbox': '*.log;summary.data;summary.xml', 'JobType': 'User', 'SystemConfig': 'ANY', 'JOB_ID': '00000000', 'StdError': 'std.err', 'Request': <DIRAC.RequestManagementSystem.Client.RequestContainer.RequestContainer instance at 0xd0eec68>, 'AccountingReport': <DIRAC.AccountingSystem.Client.DataStoreClient.DataStoreClient instance at 0xd19a830>, 'ParametricInputSandbox': '', 'JobGroup': 'lhcb', 'StdOutput': 'std.out', 'Origin': 'DIRAC', 'Site': 'ANY', 'PRODUCTION_ID': '00000000', 'MaxCPUTime': '5000', 'LogFilePath': '/project/bfys/dremensk/ctmdev/LHCbDirac_v6r8p2/etc', 'InputData': ''} 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile DEBUG: {'applicationName': 'DaVinci', 'STEP_DEFINITION_NAME': 'DaVinciStep1', 'applicationVersion': 'v29r2', 'JOB_ID': '00000000', 'optionsLine': '', 'STEP_NUMBER': '1', 'StartStats': (4.2199999999999998, 0.23999999999999999, 0.0, 0.0, 8676995.6999999993), 'STEP_INSTANCE_NAME': 'RunDaVinciStep1', 'inputDataType': 'DATA', 'applicationLog': 'Step1_DaVinci_v29r2.log', 'optionsFile': '/project/bfys/dremensk/DaVinci-Default.py', 'PRODUCTION_ID': '00000000', 'STEP_ID': '00000000_00000000_1', 'StartTime': 1326251676.3210549, 'inputData': 'LFN:/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000753_1.bhadron.dst'} 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Input data defined in workflow for this Gaudi Application step 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  VERB: Job has no input data requirement 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  VERB: Performing log file analysis for Step1_DaVinci_v29r2.log 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Resolved the step input data to be:
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: LFN:/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000753_1.bhadron.dst 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Resolved the job input data to be:
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO:  
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Attempting to open log file: Step1_DaVinci_v29r2.log 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Attempting to parse xml log file: summaryDaVinci_00000000_00000000_1.xml 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Check application ended successfully e.g. searching for "Application Manager Finalized successfully" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Terminating event processing loop due to errors" meaning job would fail with "Event Loop Not Terminated" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "SysError in <TDCacheFile::ReadBuffer>: error reading from file" meaning job would fail with "DCACHE connection error" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for " glibc " meaning job would fail with "Problem with glibc" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Failed to resolve" meaning job would fail with "IODataManager error" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Writer failed" meaning job would fail with "Writer failed" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Not found DLL" meaning job would fail with "Not found DLL" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Cannot connect to database" meaning job would fail with "error database connection" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Standard std::exception is caught" meaning job would fail with "Exception caught" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Error: connectDataIO" meaning job would fail with "connectDataIO error" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Bus error" meaning job would fail with "Bus error" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "segmentation violation" meaning job would fail with "segmentation violation" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Error:connectDataIO" meaning job would fail with "connectDataIO error" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "GaussTape failed" meaning job would fail with "GaussTape failed" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "Could not connect" meaning job would fail with "CASTOR error connection" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: Checking for "User defined signal 1" meaning job would fail with "User defined signal 1" 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: XMLSummary reports success = True  
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: XMLSummary reports step finalized 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: 0 file(s) on fail status, 0 file(s) on part status, 1 file(s) on full status, 0 file(s) on other status, 0 file(s) on mult status 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile ERROR: {'Data': {}, 'Message': '0 file(s) on fail status, 0 file(s) on part status, 1 file(s) on full status, 0 file(s) on other status, 0 file(s) on mult status', 'OK': False} 
Exception while module execution
Module DaVinciStep1 AnalyseXMLLogFile
'listoutput'
== EXCEPTION ==
<type 'exceptions.KeyError'>: 'listoutput'

  File "/project/bfys/dremensk/cmtdev/LHCbDirac_v6r8p2/InstallArea/python/DIRAC/Core/Workflow/Step.py", line 272, in execute
    result = step_exec_modules[mod_inst_name].execute()

  File "/project/bfys/dremensk/cmtdev/LHCbDirac_v6r8p2/InstallArea/python/LHCbDIRAC/Workflow/Modules/AnalyseXMLLogFile.py", line 108, in execute
    self.__finalizeWithErrors( result['Message'] )

  File "/project/bfys/dremensk/cmtdev/LHCbDirac_v6r8p2/InstallArea/python/LHCbDIRAC/Workflow/Modules/AnalyseXMLLogFile.py", line 258, in __finalizeWithErrors
    self.workflow_commons['outputList'] = self.step_commons['listoutput']
===============
All the data access information is effectively there in the last printed lines, can be reused!
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: XMLSummary reports success = True  
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: XMLSummary reports step finalized 
2012-01-11 03:14:59 UTC dirac-jobexec/AnalyseXMLLogFile  INFO: 0 file(s) on fail status, 0 file(s) on part status, 1 file(s) on full status, 0 file(s) on other status, 0 file(s) on mult status 

To enable producing the summary.xml at the end of a job execution, the following should be put in the job description:

danielar@herault etc $ cat DaVinci-Default.py
from Configurables import DaVinci

d = DaVinci()
DaVinci().DataType = "2011"
DaVinci().EvtMax = 15000

from Configurables import LHCbApp
LHCbApp().XMLSummary='summary.xml'
from Configurables import XMLSummarySvc
xmlsum=XMLSummarySvc("CounterSummarySvc")

Question How will we force jobs to produce the summary.xml when it is optionally set by the users? Or should we parse the data access directly from the Job log file? (example Step1_DaVinci_v29r2.log)

Four possible statuses of the files reported in summary.xml after the payload execution:

- full : the file has been fully read

- part : the file has been partially read

- mult : the file has been read multiple times

- fail : failure while reading the file

An example:

danielar@herault 2743 $ cat summary.xml 
<?xml version="1.0" encoding="UTF-8"?>
<summary version="1.0" xsi:noNamespaceSchemaLocation="$XMLSUMMARYBASEROOT/xml/XMLSummary.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <success>True</success>
   <step>finalize</step>
   <usage>
      <stat unit="KB" useOf="MemoryMaximum">747096.0</stat>
   </usage>
   <input>
      <file GUID="" name="LFN:/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000753_1.bhadron.dst" status="full">12092</file>
      <file GUID="" name="LFN:/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000583_1.bhadron.dst" status="part">2908</file>
   </input>
   <output />
   <counters>
      <counter name="DaVinciInitAlg/Events">15000</counter>
      <counter name="CounterSummarySvc/handled">15003</counter>
   </counters>
   <lumiCounters />
</summary>

We should report only in case the application execution is successful, and the final step is reached.

ALERT!However, we can't force users to use LFNs, for instance, a job can be specified with the PFNs directly, and it will still work:

from Configurables import DaVinci
##############################################################################
d = DaVinci()
DaVinci().DataType = "2011"
DaVinci().EvtMax = 15000
from Gaudi.Configuration import *
EventSelector().Input=[ "DATAFILE='dcap://bee37.grid.sara.nl:22125/pnfs/grid.sara.nl/data/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000753_1.bhadron.dst' TYP='POOL_ROOTTREE' OPT='READ'"];

In this case the job inputData parameter is an empty list, and the JobWrapper does NOT go through the InputDataResolution step (no Input Data Policy is applied). But the job reads events successfully, and the summary.xml file will look like:

<?xml version="1.0" encoding="UTF-8"?>
<summary version="1.0" xsi:noNamespaceSchemaLocation="$XMLSUMMARYBASEROOT/xml/XMLSummary.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <success>True</success>
   <step>finalize</step>
   <usage>
      <stat unit="KB" useOf="MemoryMaximum">578596.0</stat>
   </usage>
   <input>
      <file GUID="4C889DC0-8929-E111-AAE4-0030487DF702" name="PFN:dcap://bee37.grid.sara.nl:22125/pnfs/grid.sara.nl/data/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000753_1.bhadron.dst" status="full">12092</file>
   </input>
   <output />
   <counters>
      <counter name="DaVinciInitAlg/Events">12092</counter>
      <counter name="CounterSummarySvc/handled">12094</counter>
   </counters>
   <lumiCounters />
</summary>
Which means the LFNs are not reported in the summary.xml, but the PFNs are. Do we consider this use case, or just ignore it? AFAIK people often (for CASTOR especially) specify job input with
 Input = "DATAFILE='PFN:castor:/castor/cern.ch/user/t/test/blabla.dst' TYP='POOL_ROOTTREE' OPT='READ'" 
An example of such a job at /afs/cern.ch/user/d/dremensk/DEBUG_2867 (Setup: LHCb-Development JobID=2867)

Question I guess in such a case we can only report the SITE and the PFN, not the SE?

Local testing of new modules:

In dirac.cfg, important lines for local testing (LHCb-Development to test on volhcb12, comment out other servers, leave only volhcb18, since it times-out with the rest.., there must be LocalArea and SharedArea defined in LocalSite)

DIRAC
{
  Setup = LHCb-Development
...
Configuration
  {
    Version = 2011-11-28 08:54:23.394934
    Name = LHCb-Prod
    #@@-rgracian@diracAdmin - /DC=es/DC=irisgrid/O=ecm-ub/CN=Ricardo-Graciani-Diaz
    EnableAutoMerge = yes
    Servers = dips://volhcb18.cern.ch:9135/Configuration/Server
    #Servers += dips://volhcb12.cern.ch:9135/Configuration/Server
    #Servers += dips://lhcb-kit.gridka.de:9135/Configuration/Server
    #Servers += dips://volhcb19.cern.ch:9135/Configuration/Server
    #Servers += dips://kot.nikhef.nl:9135/Configuration/Server
    #Servers += dips://vobox07.pic.es:9135/Configuration/Server
    #Servers += dips://volhcb30.cern.ch:9135/Configuration/Server
    #Servers += dips://cclcglhcb01.in2p3.fr:9135/Configuration/Server
    #Servers += dips://lcgvo-s3-03.gridpp.rl.ac.uk:9135/Configuration/Server
    #Servers += dips://lhcbprod.pic.es:9135/Configuration/Server
    #Servers += dips://ui01-lhcb.cr.cnaf.infn.it:9135/Configuration/Server
    MasterServer = dips://volhcb18.cern.ch:9135/Configuration/Server
  }
...
LocalSite
{
  #@@-rgracian@diracAdmin - 2011-06-22 17:01:24
  FileCatalog = LcgFileCatalogCombined
  LocalSE = SARA-RAW
  LocalSE += SARA-RDST
  LocalSE += SARA-ARCHIVE
  LocalSE += SARA-DST
  LocalSE += SARA_M-DST
  LocalSE += SARA-USER
  Architecture = x86_64-slc5-gcc43-opt
  Site = LCG.NIKHEF.nl
  LocalArea = .
  SharedArea = /project/bfys/lhcb/sw
}

Testing jobs without submitting them (mode = 'local') to the WMS and involving the entire DIRAC machinery (other mode is 'agent')

from LHCbDIRAC.Interfaces.API.DiracLHCb import DiracLHCb
from LHCbDIRAC.Interfaces.API.LHCbJob import LHCbJob
j = LHCbJob()
j.setCPUTime(5000)
j.setApplication('DaVinci','v29r2','/project/bfys/dremensk/DaVinci-Default.py', inputData=['/lhcb/LHCb/Collision11/BHADRON.DST/00012957/0000/00012957_00000753_1.bhadron.dst'])
j.setOutputSandbox(['*.log','summary.data','summary.xml'])
j.setLogLevel('debug')
j.setInputDataType('DATA')
dirac = DiracLHCb()
jobID = dirac.submit(j,mode='local')
print 'Submission Result: ',jobID
print j._toXML()
The last line will generate an .xml description of the job requirements. The workflow modules, steps, parameters and input data are defined here, and can be modified to include new modules/steps. An example can be found at: /afs/cern.ch/user/d/dremensk/JobInput.xml

export VO_LHCB_SW_DIR=/project/bfys/lhcb/sw
SetupProject LHCbDIRAC v6r8p2
export DIRACSYSCONFIG=/project/bfys/dremensk/cmtdev/LHCbDirac_v6r8p2/etc/dirac.cfg
dirac-jobexec JobInput.xml -o LogLevel=debug

Once the module is implemented....

A simple example to insert entries in the Popularity table:
from DIRAC.Core.Base.Script import parseCommandLine
parseCommandLine()
from DIRAC.Core.DISET.RPCClient import RPCClient
s = RPCClient("DataManagement/DataUsage")
se = 'CERN-DST'
dirDict = {}
dirDict['/test/'] = 2
dirDict['/lhcb/certification/test/ALLSTREAMS.DST/00000002/0000/'] = 5
s.sendDataUsageReport( se, dirDict )

-- DanielaRemenska - 11-Jan-2012

Edit | Attach | Watch | Print version | History: r7 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2012-01-11 - DanielaRemenska
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback