Machine Learning on Big Data for CMS use-case

The WLCG monitoring infrastructure is undergoing a major extension of both volume and variety of the monitoring data. This is due to either higher LHC luminosity expected in the next runs and new resource deployment to cloud-based platforms. In this scenario traditional relational database systems used to store and serve monitoring events hit scalability limits. They don’t scale well to very large sizes and are also not optimized for unstructured data search (i.e. Google type searching), they do not handle data in unexpected formats well and it is difficult to implement certain kinds of basic queries using SQL (e.g. the shortest path between two points).

On the other hand, “NoSQL” databases explicitly sacrifice some of the traditional SQL capabilities (transactions, locking, consistency, triggers, constraints on the format, granularity and timeliness of data etc…) in order to allow effective unlimited horizontal scalability. Over the past few years, a full stack of new technologies was originated by the development of the Hadoop file system, the MapReduce programming language, and associated databases. One of the key capabilities of a Hadoop-like environment is the ability to dynamically and easily expand the number of servers being used for data storage. The cost of storing large amounts of data in a relational database gets very expensive, where cost grows geometrically with the amount of data to be stored, reaching a limit in the petabyte range. The cost of storing data in a Hadoop solution grows linearly with the volume of data and there is no ultimate limit. The trade-off for running these large-scale computations is the high latency or, in certain cases, limited data models.

These pages envision the work that needs to be performed on the research, the design and the development of the new data store and analytics platform for the evolution of the WLCG monitoring and large scale data analysis. The project is done in collaboration with the CERN IT department and the CMS collaboration at INFN Pisa. It can be decomposed in two main areas of work.

  • Analysis of the state of the art technology and of the WLCG requirements. During this phase a constantly growing dataset will be stored on a Hadoop cluster. Several computing frameworks will be tested and compared: Java Hadoop , Scala Spark and Apache Pig. A comparison of the processing times required to aggregate the same class of monitoring data on database materialized views and MapReduce-like tasks will provide the first feedback. A bottom-up approach will be followed, by means of picking a limited amount of information that we want to learn about the system (dataset popularity, network statistics, file transfer activity) and starting pilot projects to see how they can be measured.
  • Learn how to run an analytics project and improve the way computing resources are used. The scope of this task goes beyond the purpose of resource monitoring, which is normally accomplished per sub-systems and presents scarce integration, which makes it hard to find correlations between different monitoring information. It aims at answering to questions like: can we predict the popularity of a new dataset? given parameters such as dataset type, content, replica, pattern of user activity, resources used in similar DS, etc… The results of this machine learning process will provide the WLCG job schedulers and transfer services a novel form of information not available so far.

The work so far and a (tentaive) outlook

  • Data warehouses at CERN
    • More than 50% (~ 300TB) of data stored in RDBMS at CERN are time series data, for example Grid monitoring and dashboards.
  • Get to know Big Data
    • Guidance of AWG at CERN-IT and INFN/CMS-Pisa
    • 2 orthogonal aspects: Big data (handling massive data volumes) and Analytics/Machine Learning (learning insights from data)
    • EOS popularity data from the last ~3 years: ~10K jobs/day, ~10KB log each
    • >>> Reproduce Oracle queries on HDFS
  • Pig and MR (my 2 cents)
    • “If you can naturally use Pig for a problem, you should definitely use Pig”
    • “If doing some work is tricky in Pig, will it probably be slow?”
    • “If don’t find a way out with Pig, then use Java MapReduce
  • List of Materialized Views
    • MV_xrdmon_rates_x_H, MV_xrdmon_procTime_x_H, MV_XRD_stat0 (MV_XRD_stat0_pre + V_XRD_LFC_aggr1), MV_DSName (MV_XRD_DS_stat0_aggr{1..4}), MV_XRD_stat1, MV_XRD_DS_stat1_aggr1, MV_DataTier, MV_XRD_stat2... and more intermediate ones
    • Bash script that runs 1 Pig job per MV, cycling from Jan to May 2015
    • Test input data: /project/awg/eos/processed/cms/2015
    • 3 available in Hadoop: 1) MV_xrdmon_rates_x_H, 2) MV_XRD_stat0, 3) MV_xrdmon_procTime_x_H
    • Output: MV_xrdmon_procTime_x_H: [YYMMDD,lfn,rb,procTime,readRate,host,decade], MV_xrdmon_rates_x_H: [otsYYMMDDhh,ctsYYMMDDhh,COUNT(replica)], MV_XRD_stat0: [lfn,otsYYMMDD,host,td,dsname, SUM(numAccesses),SUM(procTime),SUM(Bytes]
  • Lesson so far
    • Hadoop is good for data warehousing
    • Scalable, many interfaces to the data, in use at CERN for dashboards, system log analysis and analytics
    • MV daily build requires 20-80mins
    • Pig queries require 1-9mins
    • 10x speed up
    • >>> Replace popularity MVs with MR queries
  • Mobile Front-end (Android app)
    • Access distribution by user (graphs)
    • File statistics by site (maps)
    • ...
  • The next big thing: Machine Learning (aka "how to run Analytics?")
    • Provide new types of input to Job schedulers and File transfer services?
    • Bottom-up approach: choose a small amount of information we want to learn about the system and start pilot projects to see how they can be measured
    • Can we predict the popularity of a new dataset? need more replicas? delete unpopular replicas? pre-compute the proper number of replicas?
    • Learning insights from data: study algorithms which focus on computational efficiency, extract information automatically before new data is available, predict outcomes from defined metrics (DS type, content, replica, user “social activity”, resources in similar DS)
  • How to extract Insights?
    • Approaches: Clustering, Association learning, Parameter estimation, Recommendation engines, Classification, Similarity matching, Neural networks, Bayesian networks, Genetic algorithms
    • Basically, it’s all maths, so it’s tricky to learn
  • Top 10 M.L. algorithms
    • C4.5, k-means clustering, Support vector machines, the Apriori algorithm, the EM algorithm, PageRank, AdaBoost, k-nearest neighbors class, Naïve Bayes, CART
  • Conclusions
    • Big data & machine learning: huge field, growing very fast, many algorithms and techniques, difficult to see the big picture
    • Online training:, Book: "Mining of Massive Datasets"
FirstName Marco
LastName Meoni



Edit personal data
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2015-06-05 - MarcoMeoni
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback