Projects

Summary

Cities are the main ground on which our society and culture develop today. A diversified population and social cohesion are crucial for sustainable urban growth. However, cities are becoming more segregated and unequal, creating substantial differences in health, education, innovation, and economic growth outcomes even within the same urban area. Despite this, our current understanding of inequality is still based on census or survey information which is updated infrequently, contains only coarse-grained information, and it is scattered across different agencies or institutions. Furthermore, those traditional ways of understanding cities are incapable of following the sudden changes our society is experiencing. In this context, Esteban Moro (data scientist at MIT Connection Science at IDSS) pointed out to us the urgent need for a “rich Census” from heterogeneous data sources to better understand and model inequality.

On a larger spatial scale, the movements of the population from one region to another can be an important mechanism of spatial economic adjustment, affecting regional demographics and growth patterns, therefore, predicting human migration as accurately as possible is important for international trade, the spread of infectious diseases, or public policy development. One downside of most existing models is that either they have a fixed form and are therefore not able to capture more complicated migration dynamics, or they are based on machine learning algorithms which makes them questionable when used for governmental decision-making. The long-term collaboration with Prof. Roberto Basile (University University of L’Aquila) in the context of spatio-temporal semiparametric econometric models lies at the core of our proposal. The aim is to provide flexible migration models within Spain that capture the socio-economic characteristics of the points of origin and destination (as well as space-time trends) to be used as a tool to propel spatial equilibrium in labor markets within the country.

Objectives

WP1: Interpretability and fairness in predictive models

L1.1: Sparsity and dimension reduction in explainable models to improve interpretability

General objective: Development of methods to achieve sparsity in complex models: via mixed model reparametrization of smooth functions and the use of matheuristics for variable selection, quantiles, and bootstrapping resampling.

L1.1.-OE1. Variable selection in Generalized Additive Models (GAMs) stated as a cardinality-constrained mathematical optimization model.
Responsible and Team: PI1, PI2, MN, PM, AC, VG
Execution Period: T1-T8
Previous Result: Carrizosa and Guerrero (2014), Currie and Durban (2002), Rodríguez-Álvarez et al. (2019), Laria et al. (2019), Palacio et al. (2021)

L1.1.-OE2. Sparsity via quantiles as an alternative to PCA
Responsible and Team: PI2, CA, AM
Execution Period: T4-T8
Previous Result: Mendez-Civieta et al. (2022)

L1.1.-OE3. Variable selection via bootstrapping resampling for support vector machine
Responsible and Team: SB, PI2
Execution Period: T2-T8
Previous Result: Benítez-Peña et al. (2019, 2020a)

L1.2: Surrogate models to enhance interpretability and improve fairness in back-box models.

General objective: The study of neural networks interpretability through polynomials using Taylor expansion in the activation functions of a single hidden layer neural network, and employing combinatorial properties to find a final polynomial that approximates the neural network as a whole and to implement novel fair and interpretable algorithms, taking as basis top classification methods in the literature (such as the SVM) and, by means of different techniques of mathematical optimization, get their fair and/or interpretable versions.

L1.2.-OE1. Surrogate models (via splines) for complex functions in Mixed Integer Non-linear Optimization
Responsible and Team: VG, PI1, CD, AD,PreDoc
Execution Period: T4 – T16
Previous Results: Navarro-García et al. (2023), D’Ambrosio et al. (2019)

L1.2.-OE2. Neural Networks Interpretability through polynomials
Responsible and Team: PI2, PI1, IU, AC, PM, SL
Execution Period: T1 – T8
Previous Result: Morala et al. (2021a, 2021b)

L1.2.-OE3. Novel fair and interpretable algorithms via mathematical optimization and/or penalizations
Responsible and Team: AM, PI2, SB, OO
Execution Period: T4-T12
Previous Result: Quijano et al. (2021), Rufino et al (2023)

WP2: Methods for complex models in functional data

L2.1: New methods in Functional Regression

General objective: To develop new functional regression models for variable domain functional data and constrained models where prior knowledge about the nature of the relationship between the covariates and the response can be included. We will build on the results of Gellar et al. (2014) but from a fully functional point of view, and the novel definition of partial inner products between multidimensional basis in Masak et al., 2022. We will also explore the possible use of the results in Navarro-García et al. (2023) to estimate constrained functional models.

L2.1.-OE1. Functional regression models for partially observed functional data
Responsible and Team: CA, PI1, PH
Execution Period: T1-T10
Previous Result: Aguilera-Morillo et al. (2017), Durban and Aguilera-Morillo (2017), Aguilera-Morillo et al. (2013)

L2.1.-OE2. Constrained estimation in functional regression
Responsible and Team: PI1, VG, MN
Execution Period: T1 – T6
Previous Result: Navarro-Garcia et al. (2023), Durban and Aguilera-Morilllo (2017)

L2.2: New perspectives for Functional Principal Components

General objective: To introduce a novel functional PLS approach that can be seen as an extension of the FPCA developed by (Lilla et al., 2016), an alternative to the penalized one-dimensional functional PLS, and a novel penalized functional PLS approach for 2D/3D domain functional data. Develop a Functional Principal Component Analysis (FPCA) methodology for functional data with complex values based on the general Karhunen-Loève representation, and the development of a new and robust methodology inspired by the FPCA but benefiting from the use of quantile-based estimators.

L2.2.-OE1. PLS methodology applicable to functional data defined over complex domains
Responsible and Team: CA, PI1, PI2, HH, PH, LS, PostDoc
Execution Period: T1-T8
Previous Results: Aguilera et al. (2010), Aguilera et al. (2016)

L2.2.-OE2. Functional Principal Components Analysis (FPCA) methodology for complex-valued functional data
Responsible and Team: PI2, CA, HH, GB, PostDoc
Execution Period: T1-T6
Previous Results: Henández-Roig et al. (2020, 2021)

L2.2.-OE3. Robust Functional Principal Component using quantile-based estimators
Responsible and Team: AM, PI2, JG
Execution Period: T1-T4
Previous Results: Mendez-Civieta et al. (2021,2022)

L2.3: Massive functional data: challenges in classification and reliability

General objective: The overall objective of this research is to develop new exploratory analyses using functional depths and custom-defined outlyingness indices. Special emphasis will be placed on the analysis of massive multivariate functional samples using the Fast Massive Unsupervised Outlier Detection (FastMUOD) indices, functional data clustering based on the epigraph and epigraph indices, and research on new approaches using FDA for real state monitoring of engineering systems.

L2.3.-OE1. Clustering and Classification in Multivariate Functional Data based on indexes
Responsible and Team: PI2, OO, BP, FS
Execution Period: T1-T8
Previous Results: Pulido et al. (2023), Ojo et al. (2022)

L2.3.-OE2. New approaches using FDA for real-life condition monitoring for engineering systems
Responsible and Team: PI2, CY, GB, OO
Execution Period: T1-T12
Previous Results: New topic for the team.

WP3: Models and tools for policy making

L3.1: Smooth models for demographic change

General objective: The aim is to develop models for detecting inequalities in mortality rates disaggregated by different factors, and to construct models to understand the spatial structure of migration models. We will build on Camarda (2019) and use Lagrange multipliers to achieve coherent predictions, and use semiparametric gravity models as an initial proposal for detecting spatial patterns in internal migrations in Spain

L3.1.-OE1. Constrained smooth methods for joint modelling of sub-populations
Responsible and Team: PI1, VG, MN, CC
Execution Period: T3-T12
Previous Results: Currie et al. (2004), Camarda (2019), Navarro-Garcia et al. (2023)

L3.1.-OE2. Spatio-temporal gravity models
Responsible and Team: PI1, RB, MC
Execution Period: T8-T16
Previous results: Basile et al. (2021), Lee and Durban (2011)

L3.2: Models and tools for monitoring inequalities social behavior

General objective: To use network science modelling, data analytics, and computational techniques to study the dynamics of digital collaborative networks and how they can be used to respond to time-critical threats, including natural disasters, pandemics and other emergencies such as cyberattacks. Also, to develop more accurate representations of human behavior.

L3.2.-OE1. Models and computational tools for the analysis of the behavior of networked and open crowdsourced responses to time-critical threats

Responsible and Team: MC, PI2, PI1, IU, AC, PostDoc

Execution Period: T1-T16

Previous Results: Martin-Corral et al. (2022), Waniek et al. (2022), Cebrián et al. (2021, 2013, 2012)

L3.2.-OE2. Modelling inequality via the creation of “Rich Census”

Responsible and Team: IU, EM, PM.

Execution Period: T1-T16

Previous Results: Berke et al. (2022), Su et al. (2022), Althobaiti et al. (2021), Hunter et al. (2021)

WP4: Software development

General objective: Update and improve the software packages developed by the team in the NEREPE project and implement at least five packages during the execution of ASMOCS. The following is a brief outline of the packages already available in the open market and those that are intended to be developed.

asgl: Python package that solves several linear and quantile regression related models for simultaneous variable selection and prediction, in low and high dimensional frameworks. This package is directly related to L1.1.

glasp: R package that implements the Group Linear Algorithm with Sparse Principal decomposition, an algorithm for supervised variable selection and clustering, glasp method proposes a unified implementation to deal with, but not limited to, linear regression, logistic regression, and proportional hazards models with right-censoring. This package is directly related to L.1.1.

qpca: Python package for the implementation of a robust, quantile based alternative to traditional principal component analysis. This package is directly related to L1.1.-OE2

metrics-computation: Python package for the computation of different evaluation metrics on the coefficients of linear regression models. This package is directly related to L1.1.

data-generation: Python package for the generation of synthetic (high dimensional) datasets. used in the evaluation of model performance. This package is directly related to L1.1.

fpqr: Python package for the usage of dimension reduction techniques based on a quantile PLS. This package is directly related to L1.1.

fqpca: R package for the usage of quantile based robust alternatives to Functional Principal Components. This package is directly related to L2.2-O3.

pen-fplsr: R package for the implementation of different versions of penalized functional partial least squares regression for 1D and 2D functional data. This package is directly related to L2.2.-OE1.

ehyclus: R package for the implementation of clustering techniques for functional data in one and more dimensions based on the epigraph and the hypograph indexes. To be developed. This package is directly related to L2.3-OE1.

nn2poly: R package implementing the NN2Poly method for interpretability of neural networks by means of polynomials. Future implementation in Python is also expected. This package is directly related to L1.2.

fdaoutlier: R package implementing various outlier detection techniques for functional data. The implemented methods work for univariate and multivariate functional data. Future development and updates to the package is expected. This package is directly related to L2.3.

cpsplines: Python package to perform constrained regression under shape constraints on the component functions of the dependent variable. Future development and updates to the package is expected. This package is directly related to L3.1-OE1.

ambss: Python package implementing a matheuristic to solve the best subset problem in GAMs. To be developed. This package is directly related to L1.1-OE1.

Responsible and Team: All the research team and the work team except foreigners are involved in this WP.
Execution Period: T1-T16
Previous Results: The packages described in this WP

WP5: Solving real problems in

AL.1: Biomedicine (Thematic area: Health and ODS3).

General objective: As explained in the justification of the proposal, ASMOCS is included in «oriented research» and this implies that part of the work will be destined to the application of all the methodology and software developed in the four previous WPs to real problems arising from the team’s collaborations with companies and institutions. The main application lines (AL) and their relationship with the ODs are as follows:

AL1. Biomedicine (Thematic area: Health and ODS3)
AL1.1. Oncology
Responsible and Team: PI2, CA, AM
Specific related objectives: L1.1.-OE2, L1.2.-OE3.
Interested companies and institutions: Gregorio Marañón Hospital.
Execution Period: T1-T16

AL1.2. Spectroscopy signals
Responsible and Team: CA, PI2, HH, PostDoc
Specific related objectives: L2.2.-OE2
Interested companies and institutions: Instituto de Investigación Sanitaria Fundación Jiménez Díaz, Madrid y Servicio de Endocrinología y Nutrición, Hospital Ramón y Cajal. Madrid.
Execution Period: T1-T6

AL1.3. Neuroimages
Responsible and Team: PI2, CA, HH, BP, LS
Specific related objectives: L2.2.-OE1
Interested companies and institutions: Politecnico di Milano, Italy
Execution Period: T1-T3

AL1.4. Wearable devices and biomedical problems
Responsible and Team: PI1, AM, CA, PH, JG
Specific related objectives: L2.2.-OE3
Interested companies and institutions: Mailman School of Public Health, Galdakao University Hospital, Bilbao
Execution Period: T1-T12

AL.2: Industry and Finance (Thematics area: Digital world, industry, space and defense, ODS9, ODS12).

AL2.1. Defense and aerospace applications
Responsible and Team: PI2, CY
Specific related objectives: L2.3.-OE1, L2.3.-OE2
Interested companies and institutions: Integral Innovation Solutions
Execution Period: T1-T4 and T13-T16

AL2.2.. Financial decisions
Responsible and Team: PI2, PI1, SB, AM, VG, OO,
Specific related objectives: L1.1.-OE2
Interested companies and institutions: Banco Santander, Universia, TCC, El Corte Inglés.
Execution Period: T1-T16

AL2.3. Electricity production
Responsible and Team: VG, PI1, CD, AD, PreDoc
Specific related objectives:
Interested companies and institutions: Electricité de France (EDF)
Execution Period: T11 – T12

AL.3: Social Change (Thematic Areas: Culture, creativity and inclusive society. Civil security for society ODS5, ODS10).

AL3.1. Using data to fight against inequalities
Responsible and Team: IU, EM
Specific related objectives: L3.2.-OE2.
Interested companies and institutions: Orange Innovation.
Execution Period: T1-T16

AL3.2. Understand social problems (pandemics, cyberwarfare, disinformation campaigns, migrations.) arising from the use of new technologies
Responsible and Team: MC, PI1, IU, RB
Specific related objectives: L3.2.-OE1.
Interested companies and institutions: UNICEF, WHO, the Spanish National Council for AI
Execution Period: T1-T16

AL3.3. Demographic challenge
Responsible and Team: PI1, VG, CC.
Specific related objectives: L3.1.-OE1. L3.1.-OE2
Interested companies and institutions. Mutualidad de la Abogacía, Instituto de Actuarios de España, Institut National d’Etudes Démographiques
Execution Period: T3-T16

TRANSMODEL

Project PDC2022-133359-I00, funded by MICIU/ AEI/10.13039/501100011033 and by the European Union “NextGenerationEU/PRTR”.

Summary

In recent years we have seen an explosion in the amount of information available from all kinds of sources. Weather data, sports data, economic indicators, medical data, genetic data, texts… Advances in data collection technologies have posed a difficult challenge in extracting increasingly complex and larger datasets. Being able to analyze these data rigorously and draw useful conclusions has become one of the most demanded objectives of the moment, which highlights the great importance of research work in statistics and machine learning. In this sense, the reference project that gives rise to this Proof of Concept (PID2019-104901RB-I00: New strategies in penalized regression with applications in health, demography and economics) with acronym NEREPE has developed relevant methodological advances throughout the evolution of the seven general objectives described in the project report. However, the development of new statistical methodologies in line with current challenges is only a first step.

This proposal encompasses two main objectives that should evolve after this first step: (1) the development and improvement of free software packages that give access to all users interested in the state-of-the-art statistical methodologies developed throughout the NEREPE project, and (2) the work on different use cases that demonstrate the feasibility and usefulness of these methodologies.

Objectives

OG1 - Desarrollo de prototipos en los ecosistemas de referencia en estadística y machine learning

OG1E1: Implementación eficiente y escalable del método NN2Poly.

OG1E2: Desarrollo de interfaz pública para la librería cpsplines.

OG1E3: Generalización de la librería cpsplines al caso multidimensional.

OG1E4: Implementación escalable de selección de variables en modelos de regresión penalizada.

OG1E5: Mejora en la eficiencia de la implementación de la librería asgl.

OG1E6: Mejora de la eficiencia y portabilidad de la implementación de la librería fpqr.

OG2 - Transferencia de conocimiento con carácter social y empresarial

OG2E1: Aplicación del método NN2Poly a datos de movilidad urbana.

OG2E2: Aplicación del método de selección de variables en modelos de regresión penalizada para la pesca sostenible.

OG2E3: Aplicación de la metodología fpqr en el ámbito financiero.

OG2E4: Aplicación de técnicas de regresión cuantílica al ámbito médico genético.

OG2E5: Interpretabilidad y transferencia.

OG2E6: Mantenimiento predictivo usando análisis de datos multivariantes.

Projects

Summary

Objectives

L1.1: Sparsity and dimension reduction in explainable models to improve interpretability

L1.2: Surrogate models to enhance interpretability and improve fairness in back-box models.

L2.1: New methods in Functional Regression

L3.1: Smooth models for demographic change

AL.1: Biomedicine (Thematic area: Health and ODS3).

AL.3: Social Change (Thematic Areas: Culture, creativity and inclusive society. Civil security for society ODS5, ODS10).

AL3.1. Using data to fight against inequalitiesResponsible and Team: IU, EMSpecific related objectives: L3.2.-OE2.Interested companies and institutions: Orange Innovation.Execution Period: T1-T16

AL3.3. Demographic challengeResponsible and Team: PI1, VG, CC. Specific related objectives: L3.1.-OE1. L3.1.-OE2Interested companies and institutions. Mutualidad de la Abogacía, Instituto de Actuarios de España, Institut National d’Etudes DémographiquesExecution Period: T3-T16

TRANSMODEL

Summary

Objectives

AL3.1. Using data to fight against inequalities
Responsible and Team: IU, EM
Specific related objectives: L3.2.-OE2.
Interested companies and institutions: Orange Innovation.
Execution Period: T1-T16

AL3.3. Demographic challenge
Responsible and Team: PI1, VG, CC.
Specific related objectives: L3.1.-OE1. L3.1.-OE2
Interested companies and institutions. Mutualidad de la Abogacía, Instituto de Actuarios de España, Institut National d’Etudes Démographiques
Execution Period: T3-T16