Statistical Learning
Modulo Unsupervised Learning

Academic Year 2024/2025 - Docente: ANTONIO PUNZO

Risultati di apprendimento attesi

1. Knowledge and understanding. The first “Unsupervised Learning” module mainly concerns the fundamentals of two of the main methods used in unsupervised learning: principal component analysis and cluster analysis.

2. Applying knowledge and understanding. On completion, the student will be able to: i) implement the main methods used in unsupervised learning; ii) summarize the main features of a dataset, and extract knowledge from data properly.

3. Making judgements. On completion, the student will be able to choose a suitable statistical model, apply it, and perform the analysis using statistical software.

4. Communication skills. On completion, the student will be able to present the results from the statistical analysis, and which conclusions can be drawn.

5. Learning skills. On completion, the student will be able to understand the structure of unsupervised learning.

Course Structure

The exam aims to evaluate the achievement of the learning objectives. It is carried out through an oral exam that includes questions related to the program in addition to the discussion of a report concerning a real data analysis performed using both the methodologies treated during the course and the R statistical software.

Required Prerequisites

Basic notions in statistics, linear algebra, and computing.

Attendance of Lessons

Highly recommended.

Detailed Course Content

Statistical Models for Univariate Random Variables (0.5 CFU). Discrete and continuous random variables. Basic distribution functions. Expectation and variance. Statistical models for random variables. Parametric inference: classical properties of estimators; the maximum likelihood approach and its properties. Goodness-of-fit tests. Model Selection. R functions and packages. Illustration in R. (Slides)

Basics of Matrices (0.10 CFU). Matrices. Special matrices. Basic matrix identities. Trace. Inverse and determinant. Eigen-decomposition. Quadratic forms and definite matrices. (Bishop 2007, Appendix C)

Basics of Multivariate Modelling (0.15 CFU). Random vectors and their distributions. Mean vector, covariance matrix, and correlation matrix. Multivariate normal distribution: properties and effect of the covariance matrix on the shape of the contours. Data Matrix, centered data matrix, and standardized data matrix. (McNeil, Frey, and Embrechts 2005, Chapter 3)

Principal Component Analysis (1 CFU). The goal of Principal Component Analysis (PCA). PCA as a tool for data visualization. Definition of principal components (PCs). PCA and Eigen-decomposition. Computing PCs. PCA: Geometrical interpretation. Choosing the number of PCs. Biplot. Illustration of PCA in R. (James G., Witten D., Hastie T., Tibshirani R. 2017, Chapter 10)

Cluster Analysis (0.75 CFU). Clustering distance/dissimilarity measures. Data types in Clustering. Data standardization. Distance matrix computation. R functions and packages. (Kassambara, 2017, Chapter 3)

Hierarchical clustering methods (1 CFU). Peculiarities. Agglomerative hierarchical clustering. Algorithm. Dendrogram. Linkage methods. Simplified example. Agglomerative hierarchical clustering methods using the data matrix. Illustration in R. (Kassambara, 2017, Chapter 7)

Partitioning (or partitional) clustering methods (1 CFU). Peculiarities. K-means clustering. Algorithm. R functions and packages. Illustration in R. K-medoids clustering. PAM Algorithm. R functions and packages. Illustration in R. (Kassambara, 2017, Chapters 4–5)

Cluster Validation (0.5 CFU). Overview. Assessing Clustering Tendency. R functions and packages. Illustration in R. Determining the Optimal Number of Clusters. R functions and packages. Illustration in R. Cluster Validation Statistics: Internal and external measures. R functions and packages. Illustration in R. Choosing the Best Clustering Algorithm(s). Measures for comparing clustering algorithms. Cluster stability measures. R functions and packages. Illustration in R. (Kassambara, 2017, Chapters 11–14)

Fuzzy Clustering (0.5 CFU). Preliminaries. Fuzzy clustering. Fuzzy K-Means. Fuzzy methods: Cluster validation. Cluster validation: fuzziness measures. Cluster validation: compactness/separation measures. Gustafson-Kessel Extensions of Fuzzy K-Means. Gustafson-Kessel-Babuska Fuzzy K-Means. Entropic Fuzzy K-Means. Fuzzy K-Means with Noise cluster. Combining entropy and a noise cluster. PAM algorithm in a fuzzy view: the Fuzzy K-Medoids. Non-Euclidean Fuzzy Relational Data Clustering. R functions and packages. Illustration in R. (Giordani et al., 2020, Chapter 5).

Model-Based Clustering (0.5 CFU). Preliminaries. Mixture models. Clustering with mixture models. Maximum a posteriori probability criterion. Gaussian mixtures. Parsimonious modeling via eigendecomposition. Choosing the number of mixture components and the best parsimonious configuration: the Bayesian information criterion. R functions and packages. Illustration in R. (Kassambara, 2017, Chapter 18).

Textbook Information

· Bishop C. M. (2007). Pattern Recognition and Machine Learning, Springer, Cambridge.

· Giordani P., Ferrero M. B. and Martella F. (2020). An Introduction to Clustering with R, Springer.

· James G., Witten D., Hastie T., Tibshirani R. (2017). An Introduction to Statistical Learning with Applications in R, Springer, New York.

· Kassambara A. (2017). Practical Guide to Cluster Analysis in R.

· McNeil A. J., Frey R., Embrechts P. (2005). Quantitative Risk Management Concepts, Techniques and Tools. Princeton University Press, Princeton, New Jersey.

Course Planning

	Subjects	Text References
1	Syllabus: illustration and explanation. Statistical Models for Univariate Random Variables. Discrete and continuous random variables. Basic distribution functions. Expectation and variance. Statistical models for random variables.	Slide
2	Parametric Inference: classical properties of estimators; the maximum likelihood approach and its properties. Goodness-of-fit tests. R functions and packages. Illustration in R.	Slide
3	Basics of Matrices. Matrices. Special matrices. Basic matrix identities. Trace. Inverse and determinant. Eigen-decomposition. Quadratic forms and definite matrices.	Bishop C. M. (2007)
4	Basics of Multivariate Modelling. Random vectors and their distributions. Mean vector, covariance and correlation matrices. Multivariate normal distribution: properties and effect of the covariance matrix on the shape of the contours.	McNeil A. J., Frey R., Embrechts P. (2005)
5	Data Matrix, centered data matrix, and standardized data matrix.	Slide
6	Principal Component Analysis (PCA). The goal of PCA. PCA as a tool for data visualization. Definition of principal components (PCs).	James G., Witten D., Hastie T., Tibshirani R. (2017)
7	PCA and Eigen-decomposition. Computing PCs. PCA: Geometrical interpretation. Choosing the number of PCs. Biplot.	James G., Witten D., Hastie T., Tibshirani R. (2017)
8	Illustration of PCA in R.
9	Cluster Analysis (CA). Clustering distance/dissimilarity measures. Data types in CA. Data standardization. Distance matrix computation. R functions and packages.	Kassambara A. (2017)
10	Hierarchical clustering methods. Peculiarities. Agglomerative hierarchical clustering. Algorithm. Dendrogram. Linkage methods. Simplified example.	Kassambara A. (2017)
11	Agglomerative hierarchical clustering methods using the data matrix. Illustration in R.	Kassambara A. (2017)
12	Partitioning (or partitional) clustering methods. Peculiarities. K-means clustering. Algorithm. R functions and packages. Illustration in R.	Kassambara A. (2017)
13	K-medoids clustering. PAM Algorithm. R functions and packages. Illustration in R.	Kassambara A. (2017)
14	Cluster Validation. Overview. Assessing Clustering Tendency. R functions and packages. Illustration in R.	Kassambara A. (2017)
15	Determining the Optimal Number of Clusters. R functions and packages. Illustration in R.	Kassambara A. (2017)
16	Cluster Validation Statistics: Internal and external measures. R functions and packages. Illustration in R.	Kassambara A. (2017)
17	Choosing the Best Clustering Algorithm(s). Measures for comparing clustering algorithms. R functions and packages. Illustration in R.	Kassambara A. (2017)
18	Cluster stability measures. R functions and packages. Illustration in R.	Kassambara A. (2017)
19	Fuzzy Clustering. Model-Based Clustering: preliminaries. Mixture models. Clustering with mixture models. Maximum a posteriori probability criterion.	Giordani P., Ferrero M. B. and Martella F. (2020)
20	Gaussian mixtures. Parsimonious modeling via eigendecomposition. Choosing the number of mixture components and the best parsimonious configuration: the Bayesian information criterion. R functions and packages. Illustration in R.	Giordani P., Ferrero M. B. and Martella F. (2020)

Learning Assessment

Learning Assessment Procedures

The purpose of the exam is to assess the attainment of the learning objectives. It involves an oral assessment featuring questions regarding the course content and a discussion on a report detailing a practical data analysis conducted using the methodologies covered in the class and the R statistical software.

The exam is structured so that each student is given a grade according to the following scheme:

- Not approved: the student has not acquired the basic concepts and is not able to answer at least 60% of the questions.

- 18-23: the student demonstrates minimal mastery of the basic concepts and his content connection skills are modest.

- 24-27: the student demonstrates a good mastery of the course contents and his skills in connecting the contents are good.

- 28-30 cum laude (distinction): the student has acquired all the contents of the course and can master them completely and connect them with a critical spirit.

Examples of frequently asked questions and / or exercises

· Explain the maximum likelihood paradigm.

· Illustrate the K-means algorithm.

· What is the Silhouette width? Explain.

· What are the peculiarities and advantages of model-based clustering?

· Illustrate the fuzzy K-medoids algorithm.

· When can we apply the likelihood-ratio test?

ENGLISH VERSION

Degree Course in

Data Science

Statistical Learning
Modulo Unsupervised Learning

Risultati di apprendimento attesi

Course Structure

Required Prerequisites

Attendance of Lessons

Detailed Course Content

Textbook Information

Course Planning

Learning Assessment

Learning Assessment Procedures

Examples of frequently asked questions and / or exercises

Università di Catania

Dove siamo

Contatti

Degree Course in

Data Science

Statistical LearningModulo Unsupervised Learning

Risultati di apprendimento attesi

Course Structure

Required Prerequisites

Attendance of Lessons

Detailed Course Content

Textbook Information

Course Planning

Learning Assessment

Learning Assessment Procedures

Examples of frequently asked questions and / or exercises

Università di Catania

Statistical Learning
Modulo Unsupervised Learning