Data Quality Anomaly Detection
For offline monitoring we agreed to have the following interface for anomaly detection:
compare(base_URL, candidate_URL, histogram_list)
base_URL
-- reference file URL (whatever ROOT supports -- AFS, EOS, CASTOR, ...)
candidate_URL
-- candidate file URL
histogram_list
-- list of paths of histograms in specified files to compare
which returns json:
{
'error_code': error_code
'error_description': error_description,
'anomaly_weights': [0,0,0,...,0], // weights of histograms containing anomaly
}
list of weights from the interval [0, 1]. The closer to 1, the higher likelihood of anomaly on that histogram. Length of returned
anomaly_weights array is equal to length of
histogram_list.
in case the corresponding histogram is not found in
base_URL
file, its weight is -1
in case the histogram is not found in
candidate_URL
file, its weight is -2
this interface will be published as HTTP method on Anomaly Detection service. It should be accessible by urllib2.urlopen method (
https://docs.python.org/2/library/urllib2.html?highlight=urllib2
).
Access to this service should be restricted by networking means, i.e. iptables (no internal authentication would be required to access it).
Internally Anomaly Detection & Prediction Service should be able to cache its predictions.
The service should be deployed eventually nearby Presenter (somewhere at online farm).
Will ask Niko for details. Preferred deployment method - via CVMFS.
--
AndreyUstyuzhanin - 2015-02-14