25 lines
1.5 KiB
Markdown
25 lines
1.5 KiB
Markdown
# RAiD Inference
|
|
|
|
The Research Activity ID is inferred by taking advantage of relationships in the graph.
|
|
The process is configured through the JSON configuration file (es: the file `raid.conf.json` in `dhp-raid/test/resources`).
|
|
|
|
The workflow is composed by three steps:
|
|
|
|
### 1. Documents creation
|
|
The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it.
|
|
|
|
*(possible lacks:)* Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels.
|
|
|
|
### 2. Embeddings creation
|
|
The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs).
|
|
|
|
*(possible lacks:)* Word2Vec creates vectors using cosine similarity.
|
|
|
|
### 3. Clustering
|
|
The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm.
|
|
|
|
*(possible lacks:)* DBSCAN is not much scalable and it strongly depends on the creation of the partitions.
|
|
|
|
### 4. (optional) Disambiguation-like processing
|
|
The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).
|