AN INTRODUCTION TO DATA-DRIVEN MODELLING & GRAPH CLUSTERING

Dr. Dante Conti

9.3.1 Summary

Nowadays when society is immersed at the era of Information and Communication Technologies, the presence of massive data in different fields and real-world applications has encouraged the use of Data Mining, Machine Learning and Artificial Intelligence approaches aimed to discover and extract non-trivial information from databases. These novel approaches are the result of multidisciplinary researches and advances associated to Applied Mathematics, Statistics, Computer Sciences, Engineering and Physics. Someauthors mention the new era of Data Science and Data Scientists by referring to academic and professional profiles with skills focused on analytics, IT and multidisciplinary thinking to solve problems under the idea of knowledge discovery in databases.

Currently, the so-called data-driven models (DDM) are becoming more and more common.DDM is based on analysing the data about a system, in particular finding connections between the system state variables (input, internal and output variables) without explicit knowledge of the physical behaviour of the system. These methods represent large advances on conventional empirical modelling with many applications which include Finance, Marketing, Medicine, Management and Environmental Sciences and so on.

Job market is seeking for experts in Analytics. Most demanded profiles include mathematicians, statisticians and engineers. Some European and American universities already include data science and data modelling in their academic curricula for undergraduate and graduate programs in Applied Mathematics, Statistics and Systems Engineering and similar disciplines.

Data-driven modelling assumes the presence of a considerable and sufficient amount ofdata describing the underlying system. Data are used to perform basically tasks of classification, pattern recognition, associative & predictive analysis.

Under these premises, the objective of this course is to introduce students in data-driven modelling. A brief overview of the concepts and methodology will be presented. Also, the main methods will be described with the support of specialized software (in this case R: A language and environment for statistical computing). An emphasis on classification and clustering will be presented in order to solve two real problems where data-driven modelling has been implemented with successful results: (1) detecting consumption patterns in urban water networks and (2) graph analysis in flow networks – A case study in air transport.

The course is designed to interact directly with the participants. Two sessions of lectures arescheduled (about 6 hours). The rest of the time will be reserved to solve real-problems underthe basis of the Hydroinformatics application or/and the flow networks (graph theory) bysupporting and coaching the participants.

9.3.2 Prerequisites:

Participants should have attended some previous courses in Operations Research or Linear Programming, Basic Statistics and some knowledge in computer software (R) is advisable.

For those with no R knowledge, an introduction to this software is available at:

https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
http://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machinelearning-and-statistics-spring-2012/lecture-notes/MIT15_097S12_lec02.pdf
https://cran.r-project.org/web/views/MachineLearning.html

9.3.3 Software:

R: A language and environment for statistical computing
Available at:
https://cran.r-project.org/
Main packages to be used: igraph, igraphdata, randomForest, rpart, tree, e1071, Nbclust.

9.3.4 Scheduling:

9.3.4.1 Monday July 03th:Lecture 1

An introduction to Data-driven Modelling and main algorithms (3-4 hours). Afternoon (from 2 p.m. or 3 p.m.). Homework: some R examples andpresentation of the first problem related to Water consumption patterns: (Milan – Italy &London U.K.)

9.3.4.2 Tuesday July 04th: Lecture 2

Graph Theory and Graph Clustering: emphasis on shortest path applications and max-flow min-cut (3-4 hours). Afternoon (from 2 p.m. or 3 p.m.).Presentation of the second problem related to Air transport in US airports. Homework: Practice of igraph:
http://kateto.net/networks-r-igraph

9.3.4.3 From July 05th to 07th

Coaching for participants and solving of the proposed problems.Participants will be divided in groups in order to facilitate the solution of the problems. Myavailability will be from 9.00 a.m. till 7 p.m.

9.3.4.4 Saturday July 08th

Final reports and oral presentations.

9.3.5 Languages:

Presentations and coaching activities will be in Portuguese. Bibliography is 100% English.

9.3.6 Bibliography:

It is necessary and advisable to read (or at least, a quick review) the following paperswhich will be used all the week long:
1) Survey: Graph clustering by Satu Elisa Schaeffer. Available at:
http://www.leonidzhukov.net/hse/2016/networks/papers/GraphClustering_Schaeffer07.pdf
2) Data-driven modelling: some past experiences and new approaches by Dimitri P. Solomatine and AviOstfeld. Available at:
http://jh.iwaponline.com/content/ppiwajhydro/10/1/3.full.pdf
3) Predictive models for forecasting hourly urban water demand. By Manuel Herrera etal. Available at:
https://www.researchgate.net/publication/223694461_Predictive_models_for_forecasting_hourly_urban_water_demand_J_Hydrol_3871-2141-150
For further information, do not hesitate to contact me at:
Este endereço de email está sendo protegido de spambots. Você precisa do JavaScript ativado para vê-lo. or Este endereço de email está sendo protegido de spambots. Você precisa do JavaScript ativado para vê-lo.

Última modificação em Terça, 20 Dezembro 2016 17:37