Saarland University Saarland University | Department of Computer Science

Big Data Analytics Group

Information Systems LAB: Processing Large Datasets using Map/Reduce and Pig Latin SoSe 09

Information and databases systems are the backbone of most modern information processing architectures and a core technology without which today's economy -- as well as many other aspects of our life -- would be impossible in their present forms. In this course the students apply advanced data managing concepts to real world problems and will solve a data-oriented task by implementing advanced data managing technology. This course is project-oriented. It aims to solve one or multiple tasks by letting students develop small software projects. Students are grouped into teams of 3-4 people. This LAB is structured as follows:
  • The initial phase of the LAB consists of lecture-style presentations. The lecturer will introduce the concepts, tools, and application domains.
  • The second phase consists of a practical phase. In this phase students will prototype small software projects that help them to get accustomed to the technology.
  • The third phase consists of lecture-stye talks that introduce students to possible application domains. The lecturer and/or domain experts will present details of possible application domains.
  • The fourth phase consists of designing and implementing a solution for a given application domain. The focus is on system-oriented work. Students may be asked to give a short presentation illustrating their solution before they implement their solution.

Topics will include:

This year we will focus on managing very large data sets (Gigabytes, Terabytes, Petabytes) using Google's map/reduce paradigm and Yahoo!'s pig latin. We will look in detail at the map/reduce implementation hadoop and the apache project Pig. We will set up a working system on a large cluster. Then we wil apply simple map/reduce pipelines as well as more advanced processing pipelines to a real application domain. Two topics from the Bioinformatics (from Prof. Lenhof's group) domain will be provided. Note, that you do not need to know any Bioinformatics to participate in this course. However, you should show willingness to understand the problem domain. We may also look at additional application areas. We may also look at additional research questions including running map/reduce on large clusters, e.g. Amazon EC2.

Administrative issues:

  • Students are expected to have successfully passed the Database System core lecture or an equivalent lecture.
  • The number of participants is limited to 15 people.
  • time and place: Fridays from 10:15 to 12:00
  • This course counts 6 ECTS.
  • Requirements for passing this course:
    • Regular attendance of classes (you may miss at most two classes)
    • Active participation in project implementation
    • Successful demonstration of programming project (teams of 3-4 students are allowed)
    • Grades are based on a final project presentation. Each team has to present their solution in front of the other participants. The presentation consists of a talk using slides as well as a software demonstration.
  • registration is closed

Lecture Notes