Registries/Reade Me.md

1.6 KiB

Registries Overlap

This folder contains input data and code to reproduce the findings reported in the IRCDL paper (Semi)automated disambiguation of scholarly repositories In the data/in folder there are the bulk downloads of the four registries, and the output of the run of the dedup algorithm. In the code folder there are the two python scripts used to create the groups, and to extend them via the output of the dedup. To reproduce the creation of the groups run

python code/crossreferencegroups.py

to extend the groups with the dedup run

python code/crossreferencededup.py

the two run will produce a set of output data:

data/out/ErrorsCrossRefs.txt contains the crossreference that clash data/out/report_registry_groups.txt contains the number of cross references among the registries data/out/toCheckGroups.txt contains the groups for which there is an error (a clash in the CR) data/out/registryGroups.txt contains the groups get from cross referencing

data/out/allDuplicateSets.txt contains the groups of all the duplicates. One group per line data/out/onlyRegistry.txt contain the groups from the registries only data/out/onlyDedup.txt contain the groups coming from the dedup only data/out/completeOverlap.txt cointains the groups that completely overlaps with the dedup clusters data/out/intersection.txt contain the groupd and clusters that intersect and their intersection data/out/newGroups.txt contains the new groups built from combination of registry crossref and dedup clusters data/out./report.txt contains a summary of the groups and the list for the various options