raid-inference/README.markdown

# RAiD Inference

The Research Activity ID is inferred by taking advantage of relationships in the graph.
The process is configured through the JSON configuration file (es: the file `raid.conf.json` in `dhp-raid/test/resources`).

The workflow is composed by three steps:

### 1. Documents creation
The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it.

*(possible lacks:)* Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels.

### 2. Embeddings creation
The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs).

*(possible lacks:)* Word2Vec creates vectors using cosine similarity.

### 3. Clustering
The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm.

*(possible lacks:)* DBSCAN is not much scalable and it strongly depends on the creation of the partitions.

### 4. (optional) Disambiguation-like processing
The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).