29 lines
1.6 KiB
Markdown
29 lines
1.6 KiB
Markdown
|
# Registries Overlap
|
||
|
This folder contains input data and code to reproduce the findings reported in the IRCDL paper *(Semi)automated disambiguation of scholarly repositories*
|
||
|
In the data/in folder there are the bulk downloads of the four registries, and the output of the run of the dedup algorithm.
|
||
|
In the code folder there are the two python scripts used to create the groups, and to extend them via the output of the dedup.
|
||
|
To reproduce the creation of the groups run
|
||
|
|
||
|
python code/crossreferencegroups.py
|
||
|
|
||
|
to extend the groups with the dedup run
|
||
|
|
||
|
python code/crossreferencededup.py
|
||
|
|
||
|
the two run will produce a set of output data:
|
||
|
|
||
|
|
||
|
data/out/ErrorsCrossRefs.txt contains the crossreference that clash
|
||
|
data/out/report_registry_groups.txt contains the number of cross references among the registries
|
||
|
data/out/toCheckGroups.txt contains the groups for which there is an error (a clash in the CR)
|
||
|
data/out/registryGroups.txt contains the groups get from cross referencing
|
||
|
|
||
|
data/out/allDuplicateSets.txt contains the groups of all the duplicates. One group per line
|
||
|
data/out/onlyRegistry.txt contain the groups from the registries only
|
||
|
data/out/onlyDedup.txt contain the groups coming from the dedup only
|
||
|
data/out/completeOverlap.txt cointains the groups that completely overlaps with the dedup clusters
|
||
|
data/out/intersection.txt contain the groupd and clusters that intersect and their intersection
|
||
|
data/out/newGroups.txt contains the new groups built from combination of registry crossref and dedup clusters
|
||
|
data/out./report.txt contains a summary of the groups and the list for the various options
|
||
|
|