Assignment 2

Dates and rules.

Please read the following guidelines carefully:

  • You can do this assigment in groups of 2.
  • The assignment must be submitted as a .zip file, compressed in zip format, and sent as an attachment by email to a.lamurias@fct.unl.pt from the official FCT email address of one of the group members with "TIAB TP2" as the title. Please do not use other compression methods or other email addresses.
  • The file name should be the following: X_Y.zip where X and Y are the student number of the group members, or X.zip if it is only one student.
  • The archive must be compressed in zip format and must contain the following:

    • TP2.txt: the questions file with your answers filled in.
    • tp2.py: the script to analyze the documents and obtain your results. Note you can also send a python notebook file.
    • Any additional .py modules that you created to write your code, if you wish to split it into different modules
    • Any .png and .html files you may wish to include as reports on your results. These should be linked in the questions and answers file TP2.txt (see the instructions in the file)
    • The .txt files you used to obtain your annotations
    • For each student, only the last email sent before the deadline will be counted. So you can change the version of the assignment submitted simply by resending your assignment before the deadline.
    • If, for some reason, you want to withdraw your assignment simply send an email with the word WITHDRAW (in all caps) before the deadline. This is necessary only if you do not want to submit your assignment. It is not necessary to do this if you just want to replace your previous submission with a more recent version.
  • The deadline for submitting assignment 2 is 10th May 23:59. There will be a tolerance period of 48h, ending on 12th May 23:59, but this period should be used only for correcting any problems with the submission.

    Download this zip file: TP2.zip. Extract it to your working folder and do not change the directory structure.

    The TP2.zip archive contains the following files:

    TP2.txt
    This is the questions and answers file you must fill out before submitting your assignment.
    tp2.ipynb
    This is the Python 3.x notebook that can be used to run your assignment. It contains the code used in the last tutorial. You can write the necessary code in this file, as well as comments describing what each cell is doing.

Description

The goal of this assignment is extract the entities of a corpus, and analyze them using an ontology, using a pipeline developed in python.

We recommend that you use documents from the CRAFT corpus, although you can also use a set of documents that is relevant to you

Choose 5 to 10 documents to process with your pipeline

In this assignment you will need to complete these tasks:

NER and Entity Linking on full-text scientific papers
Apply the pipeline shown at the end of Lecture 12 to a set of scientific papers, either from the proposed corpus or from some other collection, to extract entities and link them to Gene Ontology terms./dd>
Analyze the results obtained
Calculate the most common entities of each document and over all documents, and how frequency of all entities found in the corpus.
Compare terms in terms of embeddings similarity
Calculate the cosine similarities between the entities identified as use that to analyze the texts. You can compare the average similarity of the entities of all documents and within each document, and look into the most similar entities pairs.
Calculate the semantic similarity between the entities found
Use a semantic similarity measure calculated on the Gene Ontology to compare the same entities, and identify cases where this similarity differs from the cosine similarity.

Instructions

Answer the questions in the txt file, using results obtained with your code. You can use plots and tables to show your results.

You can add html files with tables. Use this website to generate the html code from an excel table, which you can paste into an excel file and mention in the answers file.

In your answers you can link .png and .html files by simply writing the name of the file in a separate line. See the instructions on the TP2.txt file.