VLDB09 Overview, part 2
I'm right back in Paris after four awesome days at Lyon where I attended the Very Large Database Conference. Overall this was a very interesting event, with both very low level technical talks and architectural presentations. The event was really well organized as I've partially mentioned in my overview of the first two days, but there is still place for improvements.
The average quality of the selected papers was pretty good, but I personally thought that there were still too many non innovative papers and badly prepared speakers. Common guys, you've got the chance to expose your work to an invaluable audience in a prestigious conference, the least you can do is to be well trained so you don't have to skip your last 10 slides... Also, another detail, how the hell can you receive 700 persons in France and serve lunch without desert ? I may be a bit greedy, but I wasn't the only disappointed person. That's it for the traditional french whine part. (But there was good wine !)
So the last two days were full of interesting sessions in various domains. Wednesday morning began with a keynote about how database technologies contribute to enhance games and simulation engines performances. For a keynote, I would have preferred a more database focused talk, because in the end it was just about having each tick of the game acting in a map / reduce fashion to compute all changes together, and some Tree usage here and there, nothing very funky (but there were some great videos of old video games to compensate...
).
The afternoon started with a great session called Map/Reduce with three presentations:
- SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions: The speech was about a commercial product, AsterDB, and their alternative approach to Hadoop / PIG. The idea is that statisticians and other people interested about processing data prefer to write SQL queries instead of coding map / reduce functions that are usually so specialized that they are not reusable. They answer this concern by providing an easily extensible SQL language so you can bypass the actual limits of SQL by writing a few lines of java for example, the whole thing being then parallelized.
- Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience: Really interesting presentation by a Yahoo! engineer, one of the creator of Pig, the famous high level language to express data analysis programs on top of Hadoop. The focus was on how Pig is designed and implemented to actually transform your logical instructions into physically distributed jobs and stages. Pretty neat.
- PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce: Highly technical talk from a Google fellow describing "a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation.". A great breakthrough in a domain where lots of state of the art learning algorithms are designed for a single machine.
The day ended on a pleasant discussion panel about "How Best to build Web Scale Data Managers ?". Lots of relevant questions and responses that point to the disappointing lack for an open source parallel database. Hopefully the next decade will solve that. (edit: I might not have understood some of the subtle sarcasm of one of the speaker, so lets remove the critic).
Thursday was even more interesting, starting with the "10-year award keynote", rewarding the most influencing paper of the last decade. The award was attributed to the MonetDB staff for their paper "Database Architecture Optimized for the New Bottleneck: Memory Access" describing a novel approach to database storage (column storage, well known nowadays) and to the join operation optimized for the CPU cache, thus avoiding the overhead of memory access, and how the column oriented engine allow for . They also talked about vectorwise and its enhanced query computation pipeline performing batch hit operations. Well done guys, it's a pretty impressive work !
After a little coffee i attended a really great talk given by a postdoc from ETH Zurich (i don't know how many of these guys came to the conference, but there were a lot !) about "Data Processing on FPGAs". I didn't knew much about these reprogrammable chips before, but the guy made them look pretty cool. The presentation described how you can achieve faster and parallelized sorts and actually save money because of the incredibly low power consumption of these chips (8W vs 102W for a CPU). (And the guy was well prepared and gave a dynamic and fun presentation !).
Another trendy topic followed: HadoopDB (yeah, there where Hadoop and map reduce all around the place). HadoopDB is "an architectural hybrid of MapReduce and DBMS technologies for analytical workloads", a work done by a few students of the Yale university. Their approach is to reconcile the two elephants together (Postgres and Hadoop both have elephants as logos), put an modified version of Hive on top of that (Hive is a SQL => Map Reduce job interface) and make use of the processing force of the databases to actually do what they are meant to do in a distributed fashion. It's an interesting track even if it feels a bit like a big hack on top of hacks. And erm, i can't say much of the end of the presentation, we havn't seen the last 10 slides or so... :/
Finally, I ended my visit at VLDB09 with two presentation of Google Interns about data mining to get structured result sets out of semi unstructured pages with lists and tables. The papers are really neat and worth having a look:
- Harvesting Relational Tables from Lists on the Web: A method to detect fields in unstructured lists and to successfully align them up into a coherent table.
- Data Integration for the Relational Web: A description of a search engine that creates clusters of related tables on the web, rank them and is able to create a single coherent table with the similar fields, and to actually extends the table with new fields on demand that may not have been directly included in the original tables. A really interesting approach that sounds like google squared
Allright, I think i'm done with this dammnnnn long post that you probably don't have read till the end, I don't care it's also a way for me to keep track of all these interesting stuff
VLDB guys, congratulations and see you next year at Singapore !
VLDB 2009: Part 1
I have the chance to attend a few conferences each year all around the world. These events are an invaluable opportunity to learn about the latest innovations and get insights from some of the most respected people in our domains.
This week I am attending Very Large Data Base 2009, a major meeting for people that deal with huge amount of data or concerned about databases performances. This event takes place each year in different cities, but we have the chance to host the 09 edition in Lyon, gathering all these brilliant people in France. Almost 700 attendees from 44 countries came up here and so far it was worth it.
The conference takes place till Friday, so I'll only share my thoughts on the past two days. First, the event is pretty well packed up. Everything from the place, the food, the quality of speakers to the variety of selected papers is pretty attractive. As usual, you are given a guide of the conference, but this one is particularly well made, with every talks abstracts, map of the city, side activities, etc... Special cool thing: All the papers are available on a USB key available in your attendee package, thus saving trees, time and money. Pretty cool.
Yesterday was a day of workshops. I attended two of them, the USETIM (Using Search Engine Technologies for Information Management), and the BIRTE (Enable Real-Time Business Intelligence). USETIM was chaired by Gregory GREFENSTETTE, Exalead's Chief Officer of Science. We saw some very interesting presentations, including big actors like Microsoft, with the demo of their Symphony platform that allows building a custom search engine (both in terms of sources and display) in a simple drag and drop interface. Unfortunately, the two talks of BIRTE I attended were not that interesting, mostly driven by industrial products.
Today began with a keynote of Raghu Ramakrishnan, Chief Scientist for Audience and Cloud Computing at Yahoo!, that talked about key-value stores, one of my favorite subject. He gave a presentation that could have been the full length version of my talk at Ignite Velocity 2009 (San Jose): Why these simple stores were created, what are the trade off of using such technologies, the details about Yahoo! PNUTS implementation and a comparison with other systems.
Another highlights of the day include a presentation from another Yahoo! fellow about "Indexing Boolean Expressions", a session on Data Visualization (largely inspired of the excellent Visual Display of Quantitative Information), and finally a paper describing a good alternative to B+Trees (a well known data structure in indexing technologies) for Flash devices and SSD drives: Lazy Adaptive Trees. Congratulation to these authors.
Beside that I really like the city, Lyon is a beautiful place ! The borders of the Rhone river are lovely and full of joyful people enjoying a little glass of wine after work, very pleasant
Hi ! I'm Jérémie, a french passionate about information retrieval, natural language processing, distributed computing, innovative web interfaces, entrepreneurship and wakeboarding !