Saarland University Saarland University | Department of Computer Science

Big Data Analytics Group

NOSQL - Managing Data (almost) without a Database System WS 10/11


  • current wave of NOSQL data managing systems and start-ups
  • e.g.: MapReduce, Hadoop, Hadoop++, HBase, BigTable, Hypertable, CouchDB, MongoDB, Cassandra, SimpleDB, PNUTS, neo4j, voldemort, ....


  • understand motivation for not using existing DBMS
  • understand technology behind those systems
  • understand when to use which system
  • understand to what degree these systems are reinventing the wheel


  • sound understanding of relational DBMS
  • i.e. at least a good grade in the Informationssysteme lecture or a comparable lecture

Administrative issues:

  • Time: Thursdays, 10:15 to 12:00
  • Place: E1.3, HS III
  • Type: advanced lecture


  • Admin, Introduction, Motivation
  • MapReduce:
    • accompanying slides
    • GFS original paper:
      Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google file system. SOSP 2003:29-43 pdf
    • MapReduce original paper:
      Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004:137-150 pdf
    • Hadoop++ paper with detailed execution plan and relational mappings in Section 2 and Appendix B.1ff:
      Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Joerg Schad: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB 3(1):518-529 (2010) pdf
    • open source implementation Hadoop with HDFS and Hadoop MapReduce
    • 39% of all companies interested in NOSQL! ;-) link
  • MapReduce and Hadoop (continued), Nephele/PACTs
    • outline (continued from last week: special case of a SQL join/groupby query that can (surprisingly) be translated to a single MR job using only map()/reduce() or the ten UDFs, failover strategies of MapReduce including task failure, master failure, skipping bad records, stragglers, backup tasks, improvements, RAFT, local checkpointing, remote checkpointing, query metadata logging, PACTs, m()-functions, single input contracts, how to model the MapReduce programming paradigm with PACTs, partitioning units)
    • RAFT tech report
    • Nephele/PACTs paper
    • Big Data and NoSQL March to the Enterprise
  • Nephele/PACTs (continued), BigTable and HBase
  • BigTable and HBase (continued)
    • accompanying slides
    • outline (TabletServer organization, log-structured writes, LSM tree, partitioned exponential file, indexing in BigTable, Chubby, detecting failed tablet servers, how to compare two index structures, compression, bitmaps)
    • The partitioned exponential file
  • BigTable (continued) and PigLatin
  • PigLatin (continued) and HiveQL
    • outline (HBase versus BigTable, pig latin demo, describe, foreach generate, filter, join, illustrate, flatten, stream, UDFs, physical execution, joins: replicated, skewed, merge, what versus how, parallel; hive and hiveql, semijoins, differences to HBase)
    • Hive
    • HiveQL Language Manual
  • BerkeleyDB
  • MongoDB
  • Storage and OctopusDB
  • Hadoop++
  • CIDR 2011, Dataspaces
  • RDF, Martin Theobald
  • Percolator and Incremental MapReduce, Rodrigo Rodrigues

Exercises and Groups

  1. Time/Location
    • Fridays, 10:15 and 14:15
    • MPII, room 0.23
  2. Assistants
  3. Assignments
  4. Rules
    • need to reach 50% of points in assignments to participate in final exam
    • need to pass either final exam or repetition exam (best exam counts)
  5. Exams
    • (06-12-10) Please register in HISPOS until January 27, 2011 (latest!).
    • Mini Midterm in December (counts 20% of your final grade)
    • Final Exam on February 10, 2011, 10:15 am to 12:00
    • Repetition Exam in March
  6. Sample Solutions
    • Solutions for exercises 1 to 6
    • Solutions for exercises 7 to 9
    • sample solutions for the assignments in pdf will be provided two weeks before the final exam.
    • Why "sample"?: for many assignments there may not only be a single right solution but multiple ones.
    • TBA