added Read Me file
This commit is contained in:
parent
39b5d4abc4
commit
6d10cdcd44
|
@ -0,0 +1,28 @@
|
||||||
|
# Registries Overlap
|
||||||
|
This folder contains input data and code to reproduce the findings reported in the IRCDL paper *(Semi)automated disambiguation of scholarly repositories*
|
||||||
|
In the data/in folder there are the bulk downloads of the four registries, and the output of the run of the dedup algorithm.
|
||||||
|
In the code folder there are the two python scripts used to create the groups, and to extend them via the output of the dedup.
|
||||||
|
To reproduce the creation of the groups run
|
||||||
|
|
||||||
|
python code/crossreferencegroups.py
|
||||||
|
|
||||||
|
to extend the groups with the dedup run
|
||||||
|
|
||||||
|
python code/crossreferencededup.py
|
||||||
|
|
||||||
|
the two run will produce a set of output data:
|
||||||
|
|
||||||
|
|
||||||
|
data/out/ErrorsCrossRefs.txt contains the crossreference that clash
|
||||||
|
data/out/report_registry_groups.txt contains the number of cross references among the registries
|
||||||
|
data/out/toCheckGroups.txt contains the groups for which there is an error (a clash in the CR)
|
||||||
|
data/out/registryGroups.txt contains the groups get from cross referencing
|
||||||
|
|
||||||
|
data/out/allDuplicateSets.txt contains the groups of all the duplicates. One group per line
|
||||||
|
data/out/onlyRegistry.txt contain the groups from the registries only
|
||||||
|
data/out/onlyDedup.txt contain the groups coming from the dedup only
|
||||||
|
data/out/completeOverlap.txt cointains the groups that completely overlaps with the dedup clusters
|
||||||
|
data/out/intersection.txt contain the groupd and clusters that intersect and their intersection
|
||||||
|
data/out/newGroups.txt contains the new groups built from combination of registry crossref and dedup clusters
|
||||||
|
data/out./report.txt contains a summary of the groups and the list for the various options
|
||||||
|
|
Loading…
Reference in New Issue