DBSQuerries

Objective

The objective of this twiki is to gather all of the DBS querries used by comp ops. So what is needed is clear for DBS and DAS developers.

Used by dbsTest

This is the list of request used by the dbsTest script dbsTest.

Use case: This script looks for a request (MonteCarlo or Data).

Lets say for MonteCarlo:

Then each request it normally has an input dataset, a run white list, a run black list, a block whitelist a block black list and an output dataset.

So what I use DBS for is to calculate the input events. Which means calculating how many events are in the input dataset, given the constrains of runs being in the run whitelist and runs not being in the blacklist and same for blocks.

Then I calculate OutputEvents, for this is really simple just calculaing the number of events in the output dataset which is is totally supported now in DAS.

Now for Data:

We do the same calculations as above but we cannot rely in number of events (because of skim requests), but we have to rely in the run, count (lumi) pairs. Basically what we do is check the run, lumi pairs in the input dataset against the output. And here again we have the constrains of run and block white and blacklists.

Frequency : 100 queries like this per hour per day

Count events in a dataset for some runs.

find dataset,sum(block.numevents) where dataset=dataset_a and (run=1 or run=2)
DAS: accepted see ticket 3972, will be supported via filesummaries DBS3 API, but need to resolve run-ranges issue, see ticket 4157. Currently this functionality can be achieved by using multiple DAS queries (starting from DAS march upgrade):
  • summary dataset=/a/b/c run=1 | grep summary.nevents
  • summary dataset=/a/b/c run=2 | grep summary.nevents

DBS3 : We will support run-ranges without wildcard support in dataset name.

Meeting on May 7th: Run ranges is not available in DBS3 yet. This query can be accomplished now by spliting it by single runs and then summing up the results.


Count events in a dataset

 find dataset,sum(block.numevents) where dataset=dataset_a 
DAS equivalent dataset=/a/b/c | grep dataset.name, dataset.nevents

DBS3: It is supported via filesummaries API.


Count (run,lumi) in a dataset (for Data requests) and sum them up

 
./dbssql --limit=10000 --input='find run, count(lumi) where dataset="+dataset+"'"+ "| \
grep '[0-9]\{1,\}' | awk '{s+=$2}END{print s}' 
DAS: you can obtain run,lumi pairs by the following query
run,lumi dataset=/a/b/c
run,lumi dataset=/a/b/c run in [1,3]

DBS3: This command (not the query) just returns the number of lumis in a dataset, that is already support by filesummaries API.

Question to ops: please clarify this use case, do you need to find number of lumis in a dataset or do you need run, nlumis pairs?

Ops: We need the run, nlumis pair, since the same lumi can be in many runs. So the important here is to know how many "unique lumis" in a dataset are? And by "unique lumis" I mean run, lumi pair. Because lumi=1 in run 1 is different from lumi=1 in run 2.

Meeting on May 7th: both cases are wanted by CompOps: 1) the run,lumi pairs, e.g. in order to find a given pair that was not processed; 2) unique number of lumis per dataset, e.g. to be able to check the output with the input dataset (Manuel said it's already available in DBS3, production). Valentin already made it available (run,lumi pairs) and CompOps should start testing it in das-test, if everything goes fine, it will be officially in production in the next cmsweb.


Count (run, lumi) in a list of blocks

./dbssql --limit=1000000 --input='find block, run, lumi where dataset="+dataset+"' | \
egrep (bloc_a|block_b) | awk '{print $2, $3}' | sort | uniq | wc -l"
Question: Do you need to count run/n_lumis within a block? Should this query be written as find block, count(run), count(lumi) where dataset=/a/b/c?

Yes we do

DAS: provides a way to find file,run,lumi triplets

file,run,lumi dataset=/a/b/c
file,run,lumi dataset=/a/b/c run in [1,3]
The look-up of block is not yet available, but I did requested from DBS3 the inclusion of block name into their filesummaries API output.

DBS3: Ops need to clarify, what they want to do here? Please, describe the clear use case. Looks like just counting lumis in a list of blocks.

Ops: We need the run, nlumis pair, since the same lumi can be in many runs. So the important here is to know how many "unique lumis" in a list of blocks ? And by "unique lumis" I mean run, lumi pair. Because lumi=1 in run 1 is different from lumi=1 in run 2.

Meeting on May 7th: I missed this part. Can someone from DAS/DBS comment on what the current status is? Is it already available in das (das-test)?

Used by CloseOutScript

This is the list of request used by the CloseOut script closeOut

Use Case:

Detect if in a MonteCarlo Dataset the same lumi is in more than one file.

Frequency

200 queries a day

Check if same lumi in a dataset is in more than one file. This is imporant to detect duplicate events in MotneCarlo datasets

./dbssql --limit=10000000 --input='find file, lumi where dataset="+dataset+"'| \
grep store| awk '{print $2}' | sort | uniq -c | awk '{print $1}' | sort | uniq | awk '{if ($1>1) print $1}'
DAS: you can look-up those pairs as following
file,lumi dataset=/a/b/c
file,lumi dataset=/a/b/c run in [1,3]

DBS3: Call listblocks for the given dataset and then loop over the blocks to get all the lumis via filelumis(block_name=....).

Comment to DBS3: I think this use case requires further clarification. If this query needs to be run very frequently I doubt that proposed DBS3 solution will scale. The dataset can produce hundreds of blocks and if comp-ops need to run this query on every MC dataset we will end-up with end-less DBS3 API calls. For example, a single dataset can produce 100 blocks. If listblocks API takes 1 second, we will need to spent 100seconds (if run sequentially) and invoke 100 API calls to DBS. Now we need to scale this with number of dataset comp-ops will require to check per day.

Example: I created a simple python code (basically that what DAS will do) and test this use case against DBS3 server. The code can be found here. I used a random dataset and got the following numbers:

dataset = /MinimumBias/Run2010B-Apr21ReReco-v1/RECO
nblocks = 44
ncalls  = 45
Elapsed time: 61.3133590221 sec
It means that DBS3 requires 1.5 sec per block to get file/lumi info, but this time is highly unstable, here is another example:
dataset = /DoubleMu/Run2011A-17Oct2011-ZMu-PromptSkim-v4/RECO
nblocks = 2
ncalls  = 3
Elapsed time: 23.0996370316 sec
This request leads to 10 sec per block to get file/lumi info. Finally, here is another example:
dataset = /SingleMu/Run2012D-ZMu-PromptSkim-v1/RAW-RECO
nblocks = 104
ncalls  = 105
Elapsed time: 270.388787985 sec
which shows the problem. To complete this single request DAS needs 4.5 minutes and I doubt that such request will be rare. Therefore I'm not sure how we can around this problem for CMS dataset with large number of blocks. DAS: this problem is solved by using concurrent requests to DBS3. Code will be available in May upgrade.

Meeting on May 7th: it's already available in das-test and CompOps needs to exercise it. Run range will also be supported (need implementation on DBS3 side). There are 3 cases that CompOps can already use here: 1) find file,run,lumi 2) find file,lumi 3) find lumi

Used by RelVals

Getting the total number of lumis in a dataset + run white list:
dbs search --query="find run,count(lumi) where dataset=/SingleMu/Run2012D-ZMu-PromptSkim-v1/RAW-RECO and (run=204577 or run=205344)" \
--noheader | awk '{ lumi += $2 } END { print lumi }'
Use Case: It's used to gather the amount of lumi sections in a given dataset + run white list. Given the total # of lumi sections and the total # of events, we have an average number of events per lumi section. According to this value we can adjust the job splitting and provide an ~ accurate ETA.

DBS3: This command (not the query) just returns the number of lumis in two runs of a dataset. This can be retrieved from DBS3 by two queries to the filesummaries API.

  • filesummaries?dataset=..&run=204577
  • filesummaries?dataset=..&run=205344
Comment: TK4136 is for supporting run range for listblock. This use case is using filesummaries.

DAS: following DBS3 suggestion you can get this by using series of DAS queries (starting DAS march upgrade):

  • summary dataset=/a/b/c run=123 | grep summary.run, summary.nlumis
  • summary dataset=/a/b/c run=234 | grep summary.run, summary.nlumis

Question to comp-ops: do you need to find number of lumis (as command does) or do you need run, nlumis pairs?

Ops: We need the run, nlumis pair, since the same lumi can be in many runs. So the important here is to know how many "unique lumis" in a dataset are? And by "unique lumis" I mean (run, lumi) pair. Because lumi=1 in run 1 is different from lumi=1 in run 2.

Meeting on May 7th: DBS3 does not support run range or run list yet. However it can be accomplished by splitting the queries per run and summing them up in the end.


Getting the amount of events in a dataset + run white list:

dbs search --query="find sum(block.numevents) where dataset=/SingleMu/Run2012D-v1/RAW and (run=204577 or run=207231)"

DBS3: Is already supported by the filesummaries API. This can be retrieved from DBS3 by two queries to the filesummaries API.

  • filesummaries?dataset=..&run=204577
  • filesummaries?dataset=..&run=207231

DAS equivalent would be (these queries will appear in March upgrade):

  • summary dataset=/SingleMu/Run2012D-v1/RAW  run=204577 | grep summary.nevents
  • summary dataset=/SingleMu/Run2012D-v1/RAW  run=207231 | grep summary.nevents

Meeting on May 7th: already available.


Getting a list of blocks in a dataset + run white list:

dbs search --query="find block where dataset=/SinglePhoton/Run2012D-v1/RAW and (run=204577 or run=207231 or run=207454 or run=208307)"
DAS: can be supported via blocks DBS3 APIs, but awaiting for support of run-ranges in this API, see DBS3 ticket 4136, DAS ticket 3974

DBS3: This case requires run list instead of a run range as requested in TK4136. Currently, single runs are supported. So you get the output by four queries. We will add both run range and run list in the future.

Meeting on May 7th: not available yet in a single query (run range and run list dependency). Ops needs to split the queries per run.

Used by MC

check how many sites have a specific GEN dataset, in order to start the GEN-SIM step when we have enough slots
find site where dataset=%s and dataset.status=*
DAS already has this query: site dataset=/a/b/c
Question: do you need to specify the status of the dataset?

Answer: if I don't specify the status in DAS, the query works only on VALID datasets. We work on all kind of datasets in principle (PRODUCTION, VALID, INVALID... ), so status=* is needed to select datasets whose status is not VALID

DAS ticket 3971, UPDATE: I removed constrain on dataset status so the query will work for any dataset regardless of the status, therefore status is not required for this query, see ticket for update and upgrade/deployment information.

Question to ops: please provide concrete example of DAS query which does not work, since I checked and DAS seems to report correctly site info for dataset with production status. So I need concrete example to look at.

Meeting on May 7th: different definitions for "site", it can be the site name, the storage element and the real site name (?). DAS will first look for in PhEDEx, if it does not get an output, then it will query DBS. Different source (phedex or dbs) of information can be used in the queries, it can be specified by "system=phedex" or "system=dbs".


check the status and the good progress of a specific dataset

find sum(block.numevents),dataset.status,dataset.createdate,max(block.moddate) where dataset=%s

DAS equivalent dataset=/a/b/c | grep dataset.nevents, dataset.status, dataset.creation_time, dataset.modification_time, dataset.nblocks, dataset.nfiles


count the number of open blocks

find count(block) where dataset=%s and block.status=1
DAS examples
  • block dataset=/a/b/c | grep block.name,block.is_open=n | count(block.name)
  • block dataset=/a/b/c | grep block.name,block.is_open=y | count(block.name)

Used by OLI

dbs search --noheader --production \
--query="find block,block.size,block.numevents,block.createdate where tier = RAW and block.createdate >= 2013-02-01 and block.createdate < 2013-03-01"
Query for all blocks of a specific data tier in a month for monthly statistics.

DAS will support this once DBS3 will provide appropriate API, see DBS3 ticket 4148, DAS ticket 3975

Meeting on May 7th: there is no date range yet in DBS3 and since it can result in very large queries, the date window size needs further discussion/definition. Tier is also not available in DBS3 yet (only era, which does not cover our needs here). Further discussion here.

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2013-05-08 - ValentinKuznetsov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback