Cross-Lingual Transfer learning for biomedical texts
Project reference: 2105
Biomedical and Life Sciences are two of the main areas presenting a considerable growth in literature through the last decade, which is demonstrated by the increase in articles indexed in PubMed (a database of biomedical articles).
An example of a BioNLP task that has received increasing attention in the BioASQ challenge, where participants have to index abstracts with multiple labels (i.e., a multilabel text classification), the performance of the proposed systems has increased considerably over the baselines and the current system used by the National Library of Medicine (NLM).
However, this shared task considers only the abstracts of English articles, not covering other languages with a considerable academic writing volume, such as Portuguese, Spanish, and French.
The goal of this research project is to provide a prototype of classification in the form of a BioASQ submission that may take into account the Spanish language as input. Given the extremely usage of deep learning architechures, the student will benefit for learning on how to perform experiments and how to design codes suitable for running on GPUs and on multiple GPUs.
Project Mentor: Marta Villegas
Project Co-mentor:Maite Melero Nogues
Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate
The student will benefit from learning with the BSC text mining group architectures related to deep learning. The student will also be able to learn on how to code and perform experiments with GPUs and coordination of jobs using multiple GPUs (parallel GPU training).
Student Prerequisites (compulsory):
Knowledge about Python
Knowledge about text processing
Student Prerequisites (desirable):
Knowledge of any deep learning framework
Previous experience with data science projects
Week 2- Research bibliography (student 1 monolingual, student 2 multilingual)
Week 3 – Plan/Schedule for the project with a conceptual model sketch for both students
Week 4 – Monolingual proof-of-concept (student 1) and Multilingual (student 2)
Week 5 – Benchmarking with state-of-the-art (students 1 and 2)
Week 6 – Improvement on the proof-of-concepts and large scale experimentation (students 1 and 2)
Week 7 – Benchmarking with state-of-the-art (students 1 and 2)
Week 8 – Final Report (students 1 and 2)
Final Product Description:
The expected outcome is a proof-of-concept of a classifier for scientific articles written in English and preferably in another language. The second outcome is the tailoring of current solutions for HPC environment.
Adapting the Project: Increasing the Difficulty:
The project could be adapted by expecting the student to identify sections in the scientific articles that are more relevant for the classification, thus providing explainability capabilities.
Adapting the Project: Decreasing the Difficulty:
The student could perform the experiments only on the English language, which already has a vast number of baselines and resources available.
GPUs, PyTorch, TensorFlow and MKL libraries on HPC clusters.
Training data, which can be supplied by the hosting group.
*Online only in any case
BSC – Barcelona Supercomputing Center