Descriptif
Langue du cours : Anglais
More and more companies are actively investing human and financial resources in data management technologies in order to improve and making effective decision making. This is mainly because a wide class of data analytics operations (e.g., statistical analysis, machine learning, ad-hoc queries) for supporting decision makers requires to process massive data sets in order to ensure high precision of analysis result. In addition, companies are more and more interested in extracting, transforming, store and analyse data involving massive and heterogeneous data collections coming from external sources (social networks, competitors' movements, market trends, etc.), typically made available by Web applications.
The main aim of this course is to give students a deep and solid understanding of the state of the art of Big Data systems and programming paradigms, and to enable them to devise and implement efficient algorithms for analysing massive data sets. The focus will be on paradigms based on distribution and shared-nothing parallelism, which are crucial to enable the implementation of algorithms that can be run on clusters of computers, scale as the size of input data increases, and can be safely executed even in the presence of system failures.
Lectures will give articular emphasis to the MapReduce paradigm and the internal aspects of its related runtime support Hadoop, as well as to MapReduce-based systems, with a particular focus on Spark that provides users with powerful programming tools and efficient execution support for performing operations related to complex data flows. The attention will be then given to mechanisms and algorithms for both both iterative and interactive data processing. A particular attention will be given to SQL-like data querying, graph analysis, and the development of machine learning algorithms.
A large part of the course consists of lab-sessions where students develop parallel algorithms for data querying and analysis, including algorithms for relational database operators, matrix operations, graph analysis, and clustering. Lab-sessions rely on the use of both desktop computers and Hadoop clusters on Google Cloud.
Diplôme(s) concerné(s)
Parcours de rattachement
Format des notes
Numérique sur 20Littérale/grade réduitPour les étudiants du diplôme MScT-Data Science for Business
Vos modalités d'acquisition :
mid-term + examen final (que les notes personnelles seront admises comme documents) + projet
Le rattrapage est autorisé (Note de rattrapage conservée)- Crédits ECTS acquis : 5 ECTS
La note obtenue rentre dans le calcul de votre GPA.