added Read Me file

This commit is contained in:
Miriam Baglioni 2023-02-13 13:11:35 +01:00
parent 39b5d4abc4
commit 6d10cdcd44
1 changed files with 28 additions and 0 deletions

28
Reade Me.md Normal file
View File

@ -0,0 +1,28 @@
# Registries Overlap
This folder contains input data and code to reproduce the findings reported in the IRCDL paper *(Semi)automated disambiguation of scholarly repositories*
In the data/in folder there are the bulk downloads of the four registries, and the output of the run of the dedup algorithm.
In the code folder there are the two python scripts used to create the groups, and to extend them via the output of the dedup.
To reproduce the creation of the groups run
python code/crossreferencegroups.py
to extend the groups with the dedup run
python code/crossreferencededup.py
the two run will produce a set of output data:
data/out/ErrorsCrossRefs.txt contains the crossreference that clash
data/out/report_registry_groups.txt contains the number of cross references among the registries
data/out/toCheckGroups.txt contains the groups for which there is an error (a clash in the CR)
data/out/registryGroups.txt contains the groups get from cross referencing
data/out/allDuplicateSets.txt contains the groups of all the duplicates. One group per line
data/out/onlyRegistry.txt contain the groups from the registries only
data/out/onlyDedup.txt contain the groups coming from the dedup only
data/out/completeOverlap.txt cointains the groups that completely overlaps with the dedup clusters
data/out/intersection.txt contain the groupd and clusters that intersect and their intersection
data/out/newGroups.txt contains the new groups built from combination of registry crossref and dedup clusters
data/out./report.txt contains a summary of the groups and the list for the various options