1.5 KiB
RAiD Inference
The Research Activity ID is inferred by taking advantage of relationships in the graph.
The process is configured through the JSON configuration file (es: the file raid.conf.json
in dhp-raid/test/resources
).
The workflow is composed by three steps:
1. Documents creation
The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it.
(possible lacks:) Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels.
2. Embeddings creation
The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs).
(possible lacks:) Word2Vec creates vectors using cosine similarity.
3. Clustering
The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm.
(possible lacks:) DBSCAN is not much scalable and it strongly depends on the creation of the partitions.
4. (optional) Disambiguation-like processing
The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).