Research topics
Multi-truth data fusion
In the big data era, in addition to having a lot of data, we can also rely on a huge amount of sources that provide information regarding overlapping data items. The data integration process aims to provide uniform access to data by aligning different data sources, resolving schema and format heterogeneity, representational ambiguity, and value conflicts. Data fusion is a crucial step of the data integration process, with the goal of resolving value conflicts that arise when sources provide different values for the same data item. In the multi-truth scenario, there is more than one correct value (e.g., the authors of a book), and the cardinality of the truth changes with respect to the specific item we are considering. We introduce the concept of source authority to enhance algorithms assessing the trustworthiness of data sources. We also study how embeddings can improve the performance of data fusion algorithms on values composed of long texts.
Big data and Healthcare
The digitization of healthcare processes generates an incredible amount of medical data, both in structured and unstructured formats. The capability of extracting knowledge and wisdom from Electronic Health Records (EHR) is a challenging problem in many respects and requires a multidisciplinary approach. From the data science point of view, we study how incident reports can be analyzed and compared to real-time sensor data to reduce injuries during working activities, how to extract medical concepts from Italian texts, and how to assess the performance of healthcare data lakes. We also investigate new methods to handle ECGs and DICOM medical images in data lakes.
Data lakes and Metadata
Data lakes can store and process raw data (without any preprocessing) in different formats. In this scenario, directly accessing raw datasets is expensive in terms of complexity and time. Therefore, extracting and exploiting metadata from differently-structured datasets is paramount. We develop new techniques to extract metadata from unstructured data (e.g., texts and images) in specific domains and study how data lakes can benefit from this important information.
Publications
* = authors listed in alphabetical order.
- [link] P. Reali, A. Carotenuto, D. Piantella, L. Tanca, P. Plebani, M. G. Signorini. Development of Data Ingestion Pipelines for the Federated Use of Biomedical Data in Research: The Health Big Data Project, MELECON 2024
- [link] D. Piantella, P. Reali, P. Kumar, L. Tanca. A Minimum Metadataset for Data Lakes Supporting Healthcare Research, SEBD 2024
- [link] F. Azzalini, D. Piantella, E. Rabosio, L. Tanca. Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity. The VLDB Journal 32, 3 (2023) *
- [link] M. Bavaro, T. Dolci, D. Piantella. BioVec-Ita: biomedical word embeddings for the Italian language, SEBD 2023 *
- [link] T. Covioli, T. Dolci, F. Azzalini, D. Piantella, E. Barbierato, M. Gribaudo. Workflow characterization of a big data system model for healthcare through multiformalism, EPEW 2023
- [link] D. Piantella. A Research on Data Lakes and their Integration Challenges, Doctoral Consortium SEBD 2022
- [link] P. Agnello, S. M. Ansaldi, F. Azzalini, G. Gangemi, D. Piantella, E. Rabosio, L. Tanca. Extraction of Medical Concepts from Italian Natural Language Descriptions, SEBD 2021 *
- [link] P. Agnello, S. M. Ansaldi, E. Lenzi, A. Mongelluzzo, D. Piantella, M. Roveri, F. A. Schreiber, A. Scutti, M. Shekari, L. Tanca. RECKOn: a REal-world, Context-aware KnOwledge-based lab, SEBD 2021 *
- [link] F. Azzalini, D. Piantella, L. Tanca. Data fusion with source authority and multiple truth, SEBD 2019 *
Community service
Program Committees
- 2021 – up to now Italian Symposium on Advanced Database Systems (SEBD)
- 2024 International Conference on Information and Knowledge Management (CIKM)
Reviews
- 2024 Health Informatics Journal
- 2024 International Conference on Information and Knowledge Management (CIKM)
- 2024 Italian Symposium on Advanced Database Systems (SEBD)
- 2023 International Conference on Data Engineering (ICDE)
- 2023 Italian Symposium on Advanced Database Systems (SEBD)
- 2022 Italian Symposium on Advanced Database Systems (SEBD)
- 2021 International Conference on Very Large Databases (VLDB)
- 2021 International Conference on Data Engineering (ICDE)
- 2021 International Conference on Extending Database Technology (EDBT)
- 2021 Italian Symposium on Advanced Database Systems (SEBD)
- 2020 International Conference on Data Engineering (ICDE)