Charla CIWS: "SAMOA: A Platform for Mining Big Data Streams"

Gianmarco De Francisci Morales (Yahoo Labs Barcelona)
29 Agosto, 2014 - 10:00
Auditorio DCC, tercer piso.
Centro de Investigación de la Web Semántica




Social media and user generated content are causing an ever growing data deluge. The rate at which we produce data is growing steadily, thus creating larger and larger streams of continuously evolving data. Online news, micro-blogs, search queries are just a few examples of these continuous streams of user activities. The value of these streams relies in their freshness, and relatedness to ongoing events. However, current (de-facto standard) solutions for big data analysis are not designed to deal with evolving streams.


In this talk we introduces SAMOA (Scalable Advanced Massive Online Analysis), a platform for mining big data streams. SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification and clustering, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. It is written in Java and is available at under the Apache Software License version 2.0.


About the speaker:


Gianmarco De Francisci Morales is a Research Scientist at Yahoo Labs Barcelona. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on large scale data mining and big data, with a particular emphasis on Web mining and Data Intensive Scalable Computing systems. He is an active member of the open source community of the Apache Software Foundation working on the Hadoop ecosystem (Giraph, S4), and a committer for the Apache Pig project. He is a co-organizer of the workshop series on Social News on the Web (SNOW) co-located with the WWW conference. He is one of the lead developers of SAMOA, an open-source platform for mining big data streams.