# RAiD Inference The Research Activity ID is inferred by taking advantage of relationships in the graph. The process is configured through the JSON configuration file (es: the file `raid.conf.json` in `dhp-raid/test/resources`). The workflow is composed by three steps: ### 1. Documents creation The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it. *(possible lacks:)* Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels. ### 2. Embeddings creation The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs). *(possible lacks:)* Word2Vec creates vectors using cosine similarity. ### 3. Clustering The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm. *(possible lacks:)* DBSCAN is not much scalable and it strongly depends on the creation of the partitions. ### 4. (optional) Disambiguation-like processing The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).