We are flooded with data be it data on the Web (html pages, twitter, facebook, map services, ...), structured data in databases (your money on bank accounts, addresses, cell phone data, school and uni grades, flight information, taxes, medical records, ...), or data in scientific applications (gene data in bioinformatics, telescope data in astronomy, collider data in physics, measurements of seismic activity in geology, ...).
The way we access, manage, and process that data has tremendous impact on:
performance. Though we sometimes think that a performance problem is due to particular algorithm requiring too much CPU time, it is often the data access patterns and retrieval times that slow down a program. The reason for bad performance may be that data cannot be accessed and shipped fast enough to the CPU. For instance you may be using unsuitable access methods to retrieve a single piece of information from a large data repository. Or you might be using an inefficient data layout ignoring the memory hierarchy and hardware capabilities of modern processors. In addition, even if the data was efficiently retrieved, performance may suffer due to picking the wrong analytical algorithms or not scaling your system correctly.
reliability. What happens if your hard disk fails or your data center is flooded with water? How do you make sure that a consistent version of your data is accessible at all times? Can you afford to lose all data? How do you exploit multi-threading for accessing data without corrupting you data repository?
If you are interested in these questions, this might be the right lecture for you.
In this core lecture you will learn how to answer these questions. You will learn fundamental data managing algorithms and techniques which are used to build (not only) database systems but also search engines (Google), file systems, data warehouses, publish/subscribe systems (like Twitter), streaming systems, map services (google maps), or Amazon's Cloud (EC2), etc.
These techniques and algorithms will allow you to design, plan, and build (almost) any kind of data managing system.
data layouts (horizontal and vertical partitioning, row stores, columns stores, hybrid mappings, PAX; fractal design, compression, defragmentation)
indexing (one- and multidimensional, tree-structured, hash-, partition-based, B-trees, bulk-loading and external sorting, differential indexing, read- and write-optimized indexing, main-memory indexes, covering, composite, sparse and dense, direct and indirect, clustered and unclustered, main memory versus disk and/or flash-based, bitmaps)
query processing algorithms (join algorithms for relational, and spatial data, grouping and early aggregation, co-grouping, filtering, external sorting and partitioning)