|
||
---|---|---|
.idea | ||
dhp-build | ||
dhp-raid | ||
.gitignore | ||
README.markdown | ||
pom.xml |
README.markdown
Research Activity Identifier (RAiD)
The Research Activity is intended as a group of research products belonging to the same scientific effort.
The workflow that drives the creation of RAiD entities from the OpenAIRE Graph consists in 4 stages. This project implements various alternatives for the stages and allows to configure the inference with a JSON configuration based on the use case scenario.
Workflow stages
1. Documents creation (SparkCreateDocuments.java)
This job transforms the graph entities involved in the RAiDs into documents by using both entity (i.e. attributes) and graph properties (i.e. semantic relationships). The documents are composed by the following words (labels):
- The research product language
- The research product subjects (i.e. the FOS subjects, to have a curated view)
- The research product country
- The project producing the research product
- The organizations involved in the research product
- A label describing the graph clique of the projects (i.e. research products in the same clique obtained after the mesh closure of the relationships with projects shares this label)
- A label describing the graph clique of the organizations (i.e. research products in the same clique obtained after the mesh closure of HasAuthorInstitutionOf relationships - or affiliations)
- A label describing the graph clique of the versions (i.e. research products in the same clique obtained after the mesh closure of isVersionOf relationships)
- A label describing the graph clique of the parts (i.e. research products in the same clique obtained after the mesh closure of HasPart relationships)
- A label describing the graph clique of the supplements (i.e. research products in the same clique obtained after the mesh closure of isSupplementedBy relationships)
An alternative to this step is implemented in SparkRandomWalks.java, where no labels are used but Random Walks are calculated on the graph.
2. Embeddings creation (SparkEmbeddings.java)
This job transforms the documents to vectors. The vector generation is performed using the Word2Vec of Apache Spark, configured to extract 128-sized embeddings. The more two documents are similar (i.e. share the same labels) the closer (i.e. based on euclidean distance) are their embeddings in the 128-dimensional space.
An alternative to this step is implemented in SparkNodeEmbeddings.java, which is meant to be used when the input are random walks.
3. Proximity clustering (SparkProximityClustering.java)
This job performs an approximate similarity join of the embeddings to draw similarity relationships between the embeddings having euclidean distance lower than 0.1. Such similarity relationships are consequently filtered by keeping only those relationships between products sharing at least 3 authors and having dates within a span of 3 years. Once the final set of similarity relationships is available, the meshes are closed and the Raw RAiD entities are created. The idea behind the approximate similarity join is to perform a cross-join based on hash tables, to prevent comparing all the entities with all the others, as only entities in the same hash bucket are compared.
The library also present a set of clustering functions based on a partitioned DBSCAN algorithm.
4. AI generation (SparkEntityInference.java)
This job transforms the Raw RAiD entities to the final version of RAiD entities. Once the Raw RAiD entities have been prepared, an LLM model generates the title and the description of each entity having as input the list of the titles and the descriptions in the Raw RAiD entity. The generation is performed with the following prompt:
“You are an assistant aimed at generating a JSON with title and description for a research activity based on the given article titles and descriptions. Format the response in JSON like this: {"title": "a title of max 10 words for the research activity", "description": "a brief abstract of min 40 and max 150 words that describes the research activity"} Make sure to fill the JSON fields with the required information, and not to exactly repeat the title in the description.”