WIP: 8549_affiliation_extraction #50

Draft
schatz wants to merge 6 commits from 8549_affiliation_extraction into main
Member

This PR introduces the following changes:

  • Restructures the "PIDs and identifiers" page to include PID authorities for all graph entities.
    • file: data-model/pids-and-identifiers.md
  • Adds a dedicated page with information for relation "hasAuthorInstitution"
    • files: data-model/relationships.md and data-model/relationships/hasAuthorInstitution.md

@claudio.atzori @miriam.baglioni as you can see, I have highlighted some parts of the pages above that need your feedback.
Please, provide your input and let me know if you need anything from me.

This PR introduces the following changes: * Restructures the "PIDs and identifiers" page to include PID authorities for all graph entities. * file: data-model/pids-and-identifiers.md * Adds a dedicated page with information for relation "hasAuthorInstitution" * files: data-model/relationships.md and data-model/relationships/hasAuthorInstitution.md @claudio.atzori @miriam.baglioni as you can see, I have highlighted some parts of the pages above that need your feedback. Please, provide your input and let me know if you need anything from me.
claudio.atzori was assigned by schatz 2023-04-04 14:39:09 +02:00
miriam.baglioni was assigned by schatz 2023-04-04 14:39:09 +02:00
thanasis.vergoulis was assigned by schatz 2023-04-04 14:39:09 +02:00
schatz added 3 commits 2023-04-04 14:39:09 +02:00
claudio.atzori reviewed 2023-04-11 15:19:18 +02:00
@ -82,0 +67,4 @@
| uniprot | [Protein Data Bank](http://www.pdb.org/) <span className="todo">[ or EMBL-EBI ?]</span> | `uniprot_____`
| ena | [Protein Data Bank](http://www.pdb.org/) <span className="todo">[ or EMBL-EBI ?]</span> | `ena_________`
| pdb | [Protein Data Bank](http://www.pdb.org/) <span className="todo">[ or EMBL-EBI ?]</span> | `pdb_________`
| handle | Any repository | <span className="todo">`handle______`</span>

This isn't correct. Handles bypasses the rule as they are integrated as legit PIDs (i.e. set in the field result.pid and result.instance.pid), but do not take part in the internal identifier creation, which depends on the datasource specific prefix + md5(localid), where the local id is typically the oai identifier

This isn't correct. Handles bypasses the rule as they are integrated as legit PIDs (i.e. set in the field `result.pid` and `result.instance.pid`), but do not take part in the internal identifier creation, which depends on the `datasource specific prefix + md5(localid)`, where the local id is typically the oai identifier
claudio.atzori reviewed 2023-04-11 15:28:37 +02:00
@ -82,0 +71,4 @@
#### Delegated authorities
<span className="todo">[TODO: the problem that this solves is that we can get a specific PID from more than one auhtoritative sources right ? For example, if we get DOIs from Crossref, Datacite, and Zenodo (btw Zenodo was not mentioned in the first table).

Does your comment refer to the concept of PID authorities, or instead to the concept of delegated authority? Zenodo is not listed in the table above because it is not a PID authority, while it is a delegated authority (for DOIs).

Anyway, the problem that delegated authorities aims to solve (with the solution still in place) were explained here D-Net/dnet-hadoop#187, I hope it helps.

Does your comment refer to the concept of PID authorities, or instead to the concept of delegated authority? Zenodo is not listed in the table above because it is not a PID authority, while it is a delegated authority (for DOIs). Anyway, the problem that delegated authorities aims to solve (with the solution still in place) were explained here https://code-repo.d4science.org/D-Net/dnet-hadoop/pulls/187, I hope it helps.
claudio.atzori reviewed 2023-04-11 15:30:52 +02:00
@ -82,0 +72,4 @@
#### Delegated authorities
<span className="todo">[TODO: the problem that this solves is that we can get a specific PID from more than one auhtoritative sources right ? For example, if we get DOIs from Crossref, Datacite, and Zenodo (btw Zenodo was not mentioned in the first table).
Can't we mention those sources by priority in the first table and simply mention in the text that we prefer to collect those PIDs starting from the first till the last one? Is this the problem or I am missing something else here?]</span>

We could list all the prefixes in a single, overcomprehensive table, that lists both the prefixes (and sources) considered as PID authorities as well as those considered as delegated authorities. However, if we do so, we should explain the two concepts before the table.

We could list all the prefixes in a single, overcomprehensive table, that lists both the prefixes (and sources) considered as PID authorities as well as those considered as delegated authorities. However, if we do so, we should explain the two concepts before the table.

I am including a few, but we should decide which datasource registries (prefixes) should be included in the table describing their PIDs.

The freq / prefix can be extracted with the following query, which returns 139 entries, below the top 10

select count(distinct id) as count, substr(id, 4, 12) as prefix
from openaire_prod_20230410.datasource
where 
    datainfo.deletedbyinference = false and
    datainfo.invisible = false 
group by substr(id, 4, 12)
order by count desc;
count	prefix	
60723	issn___print	
25641	doajarticles	
21660	issn__online	
6219	opendoar____	
2799	tubitakulakb	
2274	re3data_____	
1905	fairsharing_	
1636	openaire____	
1311	eurocrisdris	
305	eosc________	
I am including a few, but we should decide which datasource registries (prefixes) should be included in the table describing their PIDs. The freq / prefix can be extracted with the following query, which returns 139 entries, below the top 10 ``` select count(distinct id) as count, substr(id, 4, 12) as prefix from openaire_prod_20230410.datasource where datainfo.deletedbyinference = false and datainfo.invisible = false group by substr(id, 4, 12) order by count desc; ``` ``` count prefix 60723 issn___print 25641 doajarticles 21660 issn__online 6219 opendoar____ 2799 tubitakulakb 2274 re3data_____ 1905 fairsharing_ 1636 openaire____ 1311 eurocrisdris 305 eosc________ ```

Similarly for organizations, we should describe the most important prefixes (indicaiting the registry/source form which the organization comes from):

select count(distinct id) as count, substr(id, 4, 12) as prefix
from openaire_prod_20230410.organization
where 
    datainfo.deletedbyinference = false and
    datainfo.invisible = false 
group by substr(id, 4, 12)
order by count desc;
168641	pending_org_	
101359	openorgs____	
8828	anr_________	
6772	microsoft___	
5060	ukri________	
3867	nih_________	
3831	nsf_________	
3619	corda_____he	
2264	corda__h2020	
2242	ror_________	
1708	snsf________	
1667	corda_______	
1662	doajarticles	
968	inca________	
547	re3data_____	
375	orgreg______	
288	fwf_________	
256	fct_________	
243	opendoar____	
233	eurocrisdris	
180	fairsharing_	
159	mestd_______	
78	aka_________	
66	sfi_________	
55	wt__________	
35	openaire____	
21	chistera____	
21	irb_hr______	
10	eosc________	
1	5b15537e653f	
1	a76a5db6484e	
1	asap________	
1	bd3a95190769	
1	euenvagency_	
1	issn02140284	
1	issn09240608	
1	issn10451064	
1	issn1111111x	
1	issn13063839	
1	issn16000870	
1	issn16000889	
1	issn16468287	
1	issn17271584	
1	issn1799649X	
1	issn1871515X	
1	issn19476108	
1	issn20030401	
1	issn20038046	
1	issn20074530	
1	issn2044852X	
1	issn20863225	
1	issn21572100	
1	issn21608288	
1	issn21836914	
1	issn21845417	
1	issn21849927	
1	issn22042482	
1	issn22118179	
1	issn22235604	
1	issn2236210X	
1	issn22500758	
1	issn23064412	
1	issn23190639	
1	issn23190884	
1	issn23195177	
1	issn2359313X	
1	issn23638761	
1	issn23796227	
1	issn23946962	
1	issn24070505	
1	issn24077100	
1	issn24556580	
1	issn24576794	
1	issn25268716	
1	issn25377043	
1	issn25418475	
1	issn25710915	
1	issn25790153	
1	issn25811614	
1	issn25824155	
1	issn25830074	
1	issn25831747	
1	issn25832468	
1	issn25835238	
1	issn2583553X	
1	issn25963856	
1	issn26028085	
1	issn26333716	
1	issn26338815	
1	issn26344580	
1	issn26548143	
1	issn26859556	
1	issn26875675	
1	issn26903865	
1	issn26905450	
1	issn2698217X	
1	issn27069346	
1	issn27087166	
1	issn27094502	
1	issn2709894X	
Similarly for organizations, we should describe the most important prefixes (indicaiting the registry/source form which the organization comes from): ``` select count(distinct id) as count, substr(id, 4, 12) as prefix from openaire_prod_20230410.organization where datainfo.deletedbyinference = false and datainfo.invisible = false group by substr(id, 4, 12) order by count desc; ``` ```count prefix 168641 pending_org_ 101359 openorgs____ 8828 anr_________ 6772 microsoft___ 5060 ukri________ 3867 nih_________ 3831 nsf_________ 3619 corda_____he 2264 corda__h2020 2242 ror_________ 1708 snsf________ 1667 corda_______ 1662 doajarticles 968 inca________ 547 re3data_____ 375 orgreg______ 288 fwf_________ 256 fct_________ 243 opendoar____ 233 eurocrisdris 180 fairsharing_ 159 mestd_______ 78 aka_________ 66 sfi_________ 55 wt__________ 35 openaire____ 21 chistera____ 21 irb_hr______ 10 eosc________ 1 5b15537e653f 1 a76a5db6484e 1 asap________ 1 bd3a95190769 1 euenvagency_ 1 issn02140284 1 issn09240608 1 issn10451064 1 issn1111111x 1 issn13063839 1 issn16000870 1 issn16000889 1 issn16468287 1 issn17271584 1 issn1799649X 1 issn1871515X 1 issn19476108 1 issn20030401 1 issn20038046 1 issn20074530 1 issn2044852X 1 issn20863225 1 issn21572100 1 issn21608288 1 issn21836914 1 issn21845417 1 issn21849927 1 issn22042482 1 issn22118179 1 issn22235604 1 issn2236210X 1 issn22500758 1 issn23064412 1 issn23190639 1 issn23190884 1 issn23195177 1 issn2359313X 1 issn23638761 1 issn23796227 1 issn23946962 1 issn24070505 1 issn24077100 1 issn24556580 1 issn24576794 1 issn25268716 1 issn25377043 1 issn25418475 1 issn25710915 1 issn25790153 1 issn25811614 1 issn25824155 1 issn25830074 1 issn25831747 1 issn25832468 1 issn25835238 1 issn2583553X 1 issn25963856 1 issn26028085 1 issn26333716 1 issn26338815 1 issn26344580 1 issn26548143 1 issn26859556 1 issn26875675 1 issn26903865 1 issn26905450 1 issn2698217X 1 issn27069346 1 issn27087166 1 issn27094502 1 issn2709894X

And last, but not least, for projects:

select count(distinct id) as count, substr(id, 4, 12) as prefix
from openaire_prod_20230410.project
where 
    datainfo.deletedbyinference = false and
    datainfo.invisible = false 
group by substr(id, 4, 12)
order by count desc;
count	prefix	
2135688	nih_________	
567250	nsf_________	
144000	ukri________	
84174	snsf________	
76210	fct_________	
38462	nwo_________	
35434	corda__h2020	
33217	nhmrc_______	
29472	arc_________	
28416	aka_________	
27702	anr_________	
25891	corda_______	
18058	wt__________	
17567	fwf_________	
16609	tubitakf____	
6391	sfi_________	
6242	corda_____he	
4071	irb_hr______	
2206	inca________	
936	mestd_______	
108	chistera____	
35	asap________	
7	euenvagency_	
5	taraexp_____	
1	cihr________	
1	nserc_______	
1	sshrc_______	

And last, but not least, for projects: ``` select count(distinct id) as count, substr(id, 4, 12) as prefix from openaire_prod_20230410.project where datainfo.deletedbyinference = false and datainfo.invisible = false group by substr(id, 4, 12) order by count desc; ``` ``` count prefix 2135688 nih_________ 567250 nsf_________ 144000 ukri________ 84174 snsf________ 76210 fct_________ 38462 nwo_________ 35434 corda__h2020 33217 nhmrc_______ 29472 arc_________ 28416 aka_________ 27702 anr_________ 25891 corda_______ 18058 wt__________ 17567 fwf_________ 16609 tubitakf____ 6391 sfi_________ 6242 corda_____he 4071 irb_hr______ 2206 inca________ 936 mestd_______ 108 chistera____ 35 asap________ 7 euenvagency_ 5 taraexp_____ 1 cihr________ 1 nserc_______ 1 sshrc_______ ```
claudio.atzori added 2 commits 2023-04-11 16:53:52 +02:00
thanasis.vergoulis added 1 commit 2023-04-21 22:35:44 +02:00
This pull request has changes conflicting with the target branch.
  • docs/data-model/pids-and-identifiers.md
  • docs/data-model/relationships/relationship-types.md
  • docs/graph-production-workflow/enrichment-by-mining/affiliation_matching.md
You can also view command line instructions.

Step 1:

From your project repository, check out a new branch and test the changes.
git checkout -b 8549_affiliation_extraction main
git pull origin 8549_affiliation_extraction

Step 2:

Merge the changes and update on Gitea.
git checkout main
git merge --no-ff 8549_affiliation_extraction
git push origin main
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/openaire-graph-docs#50
No description provided.