Go to file
Michele De Bonis b56066ea5a reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00
.idea reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00
dhp-build reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00
dhp-raid reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00
.gitignore initial commit 2024-11-05 10:35:15 +01:00
README.markdown reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00
pom.xml reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00

README.markdown

RAiD Inference

The Research Activity ID is inferred by taking advantage of relationships in the graph. The process is configured through the JSON configuration file (es: the file raid.conf.json in dhp-raid/test/resources).

The workflow is composed by three steps:

1. Documents creation

The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it.

(possible lacks:) Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels.

2. Embeddings creation

The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs).

(possible lacks:) Word2Vec creates vectors using cosine similarity.

3. Clustering

The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm.

(possible lacks:) DBSCAN is not much scalable and it strongly depends on the creation of the partitions.

4. (optional) Disambiguation-like processing

The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).