added Read Me file
This commit is contained in:
parent
39b5d4abc4
commit
6d10cdcd44
|
@ -0,0 +1,28 @@
|
|||
# Registries Overlap
|
||||
This folder contains input data and code to reproduce the findings reported in the IRCDL paper *(Semi)automated disambiguation of scholarly repositories*
|
||||
In the data/in folder there are the bulk downloads of the four registries, and the output of the run of the dedup algorithm.
|
||||
In the code folder there are the two python scripts used to create the groups, and to extend them via the output of the dedup.
|
||||
To reproduce the creation of the groups run
|
||||
|
||||
python code/crossreferencegroups.py
|
||||
|
||||
to extend the groups with the dedup run
|
||||
|
||||
python code/crossreferencededup.py
|
||||
|
||||
the two run will produce a set of output data:
|
||||
|
||||
|
||||
data/out/ErrorsCrossRefs.txt contains the crossreference that clash
|
||||
data/out/report_registry_groups.txt contains the number of cross references among the registries
|
||||
data/out/toCheckGroups.txt contains the groups for which there is an error (a clash in the CR)
|
||||
data/out/registryGroups.txt contains the groups get from cross referencing
|
||||
|
||||
data/out/allDuplicateSets.txt contains the groups of all the duplicates. One group per line
|
||||
data/out/onlyRegistry.txt contain the groups from the registries only
|
||||
data/out/onlyDedup.txt contain the groups coming from the dedup only
|
||||
data/out/completeOverlap.txt cointains the groups that completely overlaps with the dedup clusters
|
||||
data/out/intersection.txt contain the groupd and clusters that intersect and their intersection
|
||||
data/out/newGroups.txt contains the new groups built from combination of registry crossref and dedup clusters
|
||||
data/out./report.txt contains a summary of the groups and the list for the various options
|
||||
|
Loading…
Reference in New Issue