raid-inference/README.markdown

# RAiD Inference 

The Research Activity ID is inferred by taking advantage of relationships in the graph.
The process is configured through the JSON configuration file (es: the file `raid.conf.json` in `dhp-raid/test/resources`).

The workflow is composed by three steps:

### 1. Documents creation
The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it.

*(possible lacks:)* Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels.

### 2. Embeddings creation
The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs).

*(possible lacks:)* Word2Vec creates vectors using cosine similarity.

### 3. Clustering
The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm.

*(possible lacks:)* DBSCAN is not much scalable and it strongly depends on the creation of the partitions.

### 4. (optional) Disambiguation-like processing
The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).
reimplementation and optimization of the procedure to create documents 2024-12-11 09:43:11 +01:00			`# RAiD Inference`

			`The Research Activity ID is inferred by taking advantage of relationships in the graph.`
			The process is configured through the JSON configuration file (es: the file `raid.conf.json` in `dhp-raid/test/resources`).

			`The workflow is composed by three steps:`

			`### 1. Documents creation`
			`The documents are created using graph nodes and relations. The purpose is to associate to each node a list of labels inherited from the nodes linked with it.`

			`(possible lacks:) Approximated Cross Join to create metapaths. Connected Components on each relationship type to create list of labels.`

			`### 2. Embeddings creation`
			`The embeddings are created using the documents of the previous step. The implementation uses a Word2Vec algorithm normalized in order to make vectors of length equal to 1 (to fit with the clustering needs).`

			`(possible lacks:) Word2Vec creates vectors using cosine similarity.`

			`### 3. Clustering`
			`The clustering is done in parallel on different partitions obtained via a preliminary K-Means algorithm. The clustering adopted for each partition is the DBSCAN algorithm.`

			`(possible lacks:) DBSCAN is not much scalable and it strongly depends on the creation of the partitions.`

			`### 4. (optional) Disambiguation-like processing`
			`The clustering keys created by the previous step can be used to group nodes and create similarity relationships between them following a JSON configuration (similar to FDup, engineered to group together nodes in the same Research Activity).`