HEP noSQL landscape

Summary

ATLAS

Needs and Usage

Logging and monitoring

  • High write rates
  • Cassandra, MongoDB, HBase

Data analytics

  • Complex computations over lots of data
  • HBase, MongoDB, Cassandra

Content and Summary retrieval

Application backend

  • Low latency, schema-less design, “native” data structures
  • Cassandra, MongoDB

Cassandra1: PANDA WMS pilot job framework

  • What: archived reference data in Panda
  • Why: largely de-normalized, does not require most of the features of RDBMS
  • classes of queries which include time and date ranges where Oracle typically does not perform too well
  • scale the increasing no of jobs
  • existing expertise, easy to conf and install
  • interface: JSON
  • scaling from 4 nodes to?

CONF:

  • 3x 1TB SSD, 48gbram,2CPU - 6cores
  • +monitor machine
  • ganglia
  • composite indexes
  • pycassa, amazon

Cassandra2: DQ2 tracing

  • grid tracer service: data access and usage
  • monitoring in real-time
  • very high write rate

Tests:

  • ~5 million traces every day
  • Write row-by-row insertion(~3KB/row): MongoDB and Cassandra better than Ora (including noSQL)
  • Read one months traces (90M rows, 34GB): Cassandra -minutes vs Ora. production hours vs. Ora test seconds

Cassandra3: TDAQ Online info service

  • persistent monitoring for lost data
  • high rate monitoring: 2500Hz
CONF:
  • 3x 4 cores/4gb ram, raid
  • 7200 obj/s => 45 batch/s => 5.6MB/s
  • datastax monitoring soft

SimpleDB: PanDA monitoring

  • cloud storage Amazon
Best solution is actually
  • Python scripts mapreducing flat CSV files on EC2
  • Fraction of a second to produce any number of summaries

MongoDB: DQ2 Accounting

  • rich data model
  • one map reduce job - locking problem!!!
  • summary better than Ora star schema
  • works weell as long as all indexis in mem: 40 mio datasets fully indexed ~ 2GB
New app: popularity
  • persistent genetic alghoritm

HBase: DQ2 Accounting

  • evolution of value in time: rolling summaries
Current state
  • DDMLAB-hosted 10 node Hadoop cluster (with no special optimisation)
  • Full migration of Oracle content (400M rows, 20GB) into HBase data model (20M rows, 4GB): 2h 40m
  • MapReduce once over data: 40 mins (random reads ~4K blocks, ~2.5MB/sec per node)
    • HDFS replication factor 4… (40*4)/60 (replication factor to a full 10, then MapReduce in 15min)
    • Compared with Oracle: 5-6 hours, if it finishes at all…
    • Can do all accounting summaries in one MapReduce run
Still ToDo
  • One-way synchronisation from Oracle to HBase - not trivial (Scibd)
  • Summary retrieval - trivial
  • Ad-hoc queries need a slightly different data model than the MapReduce one

Hadoop: DQ2 Log analysis

  • pytohon based
Map-Reducing CSV files on Hadoop HDFS
  • Dump table from Oracle (DQ2 Traces)
  • Copy log files to HDFS (DQ2 Apache Logs)
  • Use Hadoop Streaming API
  • write results directly to HBase (with Java/Jython)
Map-Reduce 75GB in 5 minutes
  • with Java libraries it should be 2.5 times faster (memory allocation savings)

CMS

MongoDB: Data Aggregation System

  • What: search and aggregate information across different data-services without knowing the specific data sources
  • Why: DAS does not require data preservation and transaction capabilities; dynamic type of stored meta-data objects (no predefined schema)

CouchDB: WMAgent

  • show/propagate/replicate information (job history, workflow reports, ...) and to provide API to build other applications
  • reduce and decouple from the core database (MySQL,ORA) the monitoring load

KyotoCabinet/SQLite: IgProf: performance tuning of CMS software

Hadoop: future use probably just HDFS

Issues

  • mongo, couch, kyoto(from CMS Offline experience with NoSQL data stores - Valentin Kuznetsov, Database Futures Workshop, CERN, June 2011)

PES Batch Monitoring

Cassandra: Apache HBase - not sure of project phase, design, implementation, testing?

GT SAM

Cassandra: hadoop - Apache Pi

References

CERN: Web:

-- FaustinRoman - 08-Nov-2011

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2011-11-08 - FaustinRoman
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback