Organizations from DB to graph raw: map the PIDs #52

Closed
opened 2020-11-09 12:39:09 +01:00 by alessia.bardi · 2 comments

Recently, I found out that we have PIDs of organizations provided by re3data. I therefore modified the transformation scripts so that those PIDs end up in the proper table fo the psql database. I was able to map ROR, VIAF and RRID identifiers. (as noted in https://issue.openaire.research-infrastructures.eu/issues/6000#note-6).

However, it looks like the mapper from psql to the raw graph does not fetch information from those table (I checked the query available at dhp-workflows/dhp-graph-mapper/src/main/resources/eu/dnetlib/dhp/oa/graph/sql/queryOrganizations.sql) and the mapper sets a default empty list (dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/raw/MigrateDbEntitiesApplication.java).

The involved postgres to include in the queries to create the array of pids are:

  • dsm_organizationpids: pid contains the PID (e.g. 'ROR:057g20z61'), organization contains the id of the organization (e.g. 're3data::a14b447509065fc5e8b513645ae23a9a')
  • dsm_identities: pid contains the PID (e.g. 'ROR:057g20z61'), issuertype contains the issuer of the pid (e.g. 'ROR')

I already added them on the beta database and I am going to do the same also in production.

In the model, organizations' pids are a list of StructuredProperty, so I believe that each pid should be created as:

pid.value = dsm_identities.pid = dsm_organizationpids.pid
pid.qualifier.classid = dsm_identities.issuerType
pid.qualifier.classname = dsm_identities.issuerType
pid.qualifier.schemeid = dnet:pid_types
pid.qualifier.schemename = dnet:pid_types

datainfo shuld contain the usual information for harvested information.

I already ensured that the vocabulary dnet:pid_types contains the entries for ROR, VIAF, and RRID, so the proper pid.qualifier.classname can be obtained via the vocabulary (not sure if you check this kind of things when importing or if the cleaning is completely left at the final cleaning step).

Recently, I found out that we have PIDs of organizations provided by re3data. I therefore modified the transformation scripts so that those PIDs end up in the proper table fo the psql database. I was able to map ROR, VIAF and RRID identifiers. (as noted in https://issue.openaire.research-infrastructures.eu/issues/6000#note-6). However, it looks like the mapper from psql to the raw graph does not fetch information from those table (I checked the query available at dhp-workflows/dhp-graph-mapper/src/main/resources/eu/dnetlib/dhp/oa/graph/sql/queryOrganizations.sql) and the mapper sets a default empty list (dhp-graph-mapper/src/main/java/eu/dnetlib/dhp/oa/graph/raw/MigrateDbEntitiesApplication.java). The involved postgres to include in the queries to create the array of pids are: * `dsm_organizationpids`: `pid` contains the PID (e.g. 'ROR:057g20z61'), `organization` contains the id of the organization (e.g. 're3data::a14b447509065fc5e8b513645ae23a9a') * `dsm_identities`: `pid` contains the PID (e.g. 'ROR:057g20z61'), `issuertype` contains the issuer of the pid (e.g. 'ROR') I already added them on the beta database and I am going to do the same also in production. In the model, organizations' pids are a list of StructuredProperty, so I believe that each pid should be created as: pid.value = dsm_identities.pid = dsm_organizationpids.pid pid.qualifier.classid = dsm_identities.issuerType pid.qualifier.classname = dsm_identities.issuerType pid.qualifier.schemeid = dnet:pid_types pid.qualifier.schemename = dnet:pid_types datainfo shuld contain the usual information for harvested information. I already ensured that the vocabulary `dnet:pid_types` contains the entries for ROR, VIAF, and RRID, so the proper `pid.qualifier.classname` can be obtained via the vocabulary (not sure if you check this kind of things when importing or if the cleaning is completely left at the final cleaning step).
alessia.bardi added the
enhancement
label 2020-11-09 12:39:09 +01:00
michele.artini was assigned by alessia.bardi 2020-11-09 12:39:09 +01:00
Member

I made the following PR: #53

I made the following PR: https://code-repo.d4science.org/D-Net/dnet-hadoop/pulls/53

PR integrated

PR integrated
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#52
No description provided.