diff --git a/Reade Me.md b/Reade Me.md new file mode 100644 index 0000000..9da36b5 --- /dev/null +++ b/Reade Me.md @@ -0,0 +1,28 @@ +# Registries Overlap +This folder contains input data and code to reproduce the findings reported in the IRCDL paper *(Semi)automated disambiguation of scholarly repositories* +In the data/in folder there are the bulk downloads of the four registries, and the output of the run of the dedup algorithm. +In the code folder there are the two python scripts used to create the groups, and to extend them via the output of the dedup. +To reproduce the creation of the groups run + +python code/crossreferencegroups.py + +to extend the groups with the dedup run + +python code/crossreferencededup.py + +the two run will produce a set of output data: + + +data/out/ErrorsCrossRefs.txt contains the crossreference that clash +data/out/report_registry_groups.txt contains the number of cross references among the registries +data/out/toCheckGroups.txt contains the groups for which there is an error (a clash in the CR) +data/out/registryGroups.txt contains the groups get from cross referencing + +data/out/allDuplicateSets.txt contains the groups of all the duplicates. One group per line +data/out/onlyRegistry.txt contain the groups from the registries only +data/out/onlyDedup.txt contain the groups coming from the dedup only +data/out/completeOverlap.txt cointains the groups that completely overlaps with the dedup clusters +data/out/intersection.txt contain the groupd and clusters that intersect and their intersection +data/out/newGroups.txt contains the new groups built from combination of registry crossref and dedup clusters +data/out./report.txt contains a summary of the groups and the list for the various options +