Organizations from DB to graph raw: map the PIDs #52
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#52
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Recently, I found out that we have PIDs of organizations provided by re3data. I therefore modified the transformation scripts so that those PIDs end up in the proper table fo the psql database. I was able to map ROR, VIAF and RRID identifiers. (as noted in https://issue.openaire.research-infrastructures.eu/issues/6000#note-6).
However, it looks like the mapper from psql to the raw graph does not fetch information from those table (I checked the query available at dhp-workflows/dhp-graph-mapper/src/main/resources/eu/dnetlib/dhp/oa/graph/sql/queryOrganizations.sql) and the mapper sets a default empty list (dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/raw/MigrateDbEntitiesApplication.java).
The involved postgres to include in the queries to create the array of pids are:
dsm_organizationpids
:pid
contains the PID (e.g. 'ROR:057g20z61'),organization
contains the id of the organization (e.g. 're3data::a14b447509065fc5e8b513645ae23a9a')dsm_identities
:pid
contains the PID (e.g. 'ROR:057g20z61'),issuertype
contains the issuer of the pid (e.g. 'ROR')I already added them on the beta database and I am going to do the same also in production.
In the model, organizations' pids are a list of StructuredProperty, so I believe that each pid should be created as:
pid.value = dsm_identities.pid = dsm_organizationpids.pid
pid.qualifier.classid = dsm_identities.issuerType
pid.qualifier.classname = dsm_identities.issuerType
pid.qualifier.schemeid = dnet:pid_types
pid.qualifier.schemename = dnet:pid_types
datainfo shuld contain the usual information for harvested information.
I already ensured that the vocabulary
dnet:pid_types
contains the entries for ROR, VIAF, and RRID, so the properpid.qualifier.classname
can be obtained via the vocabulary (not sure if you check this kind of things when importing or if the cleaning is completely left at the final cleaning step).I made the following PR: #53
PR integrated