Claudio Atzori
|
aff3ddc8d2
|
added cleaning for the format field, removing carrige return and tab characters
|
2021-12-14 11:41:46 +01:00 |
Miriam Baglioni
|
936578aaf1
|
Merge branch 'beta' of https://code-repo.d4science.org/D-Net/dnet-hadoop into beta
|
2021-12-13 15:01:47 +01:00 |
Claudio Atzori
|
41c70c607d
|
cleaning workflow assigns the proper default instance type when a value could not be cleaned using the vocabularies
|
2021-12-09 16:44:28 +01:00 |
Claudio Atzori
|
e6e177dda0
|
vocabulary based cleaning considers also the term label when looking up for a synonym
|
2021-12-09 13:57:53 +01:00 |
Miriam Baglioni
|
b113586207
|
resolved conflicts
|
2021-12-07 10:16:14 +01:00 |
Sandro La Bruzzo
|
5d51b3dd4a
|
Merge pull request 'scala_refactor' (#169) from scala_refactor into beta
Reviewed-on: #169
|
2021-12-06 15:33:44 +01:00 |
Miriam Baglioni
|
96a7d46278
|
[Graph Dump] fixed tests
|
2021-12-06 15:06:32 +01:00 |
Sandro La Bruzzo
|
81bf604059
|
[scala-refactor] Module dhp-common:
Moved all scala source into src/main/scala and src/test/scala
|
2021-12-06 11:29:24 +01:00 |
Claudio Atzori
|
9132727793
|
fixed date cleaning test
|
2021-12-06 10:54:05 +01:00 |
Claudio Atzori
|
863a2f9db3
|
avoid to filter OAF records defined as invisible = true
|
2021-12-03 09:08:12 +01:00 |
Miriam Baglioni
|
8905a39bf3
|
mergin with branch beta
|
2021-12-02 13:17:29 +01:00 |
Sandro La Bruzzo
|
1e1f5e4fe0
|
minor fix
|
2021-11-25 13:03:17 +01:00 |
Sandro La Bruzzo
|
2164a2a889
|
Datacite: Code Refactor generated a general SparkApplication Scala where all the spark scala have to inherit
Commented a little the Datacite transformation code
|
2021-11-25 10:54:13 +01:00 |
Sandro La Bruzzo
|
4542a2338b
|
updated site configuration to deploy on website
|
2021-11-19 13:44:08 +01:00 |
Miriam Baglioni
|
9fae872181
|
[Graph Dump] changed to mirror the changes in the model
|
2021-11-19 11:25:50 +01:00 |
Claudio Atzori
|
62fa61f3cf
|
merge from beta
|
2021-11-19 09:23:42 +01:00 |
Claudio Atzori
|
bd9a43cefd
|
Revert to 4094f2bb9a
|
2021-11-19 09:20:43 +01:00 |
Claudio Atzori
|
82a4e4efae
|
[cleaning wf] fixed methodology to rule out invalid result titles, based on https://support.openaire.eu/issues/7206
|
2021-11-17 14:17:22 +01:00 |
Sandro La Bruzzo
|
60ae874dcb
|
Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into mvn_site_documentation
|
2021-11-17 11:08:34 +01:00 |
Claudio Atzori
|
49f897ef29
|
[cleaning wf] fixed regex used to spot garbage in result titles; adjusted threshold for filtering titles
|
2021-11-16 15:24:23 +01:00 |
Sandro La Bruzzo
|
a1cafaf2e3
|
added mvn site for dnet-hadoop project
|
2021-11-16 15:16:28 +01:00 |
Sandro La Bruzzo
|
aafdffa6b3
|
resolved conflict
|
2021-10-26 09:45:46 +02:00 |
Sandro La Bruzzo
|
034304b33a
|
conflict resolved on merge
|
2021-10-26 09:40:47 +02:00 |
Claudio Atzori
|
6b34ba737e
|
minor
|
2021-10-21 14:16:18 +02:00 |
Sandro La Bruzzo
|
ae4e99a471
|
Adapted workflow of resolution of PID to work into OpenAIRE data workflow
- Added relations in both verse on all Scholexplorer datasources
|
2021-10-20 17:12:16 +02:00 |
Miriam Baglioni
|
c8321ad31a
|
merge with branch beta
|
2021-10-01 12:59:08 +02:00 |
Claudio Atzori
|
663b1556d7
|
manually integrating PR#140 #140
|
2021-09-15 16:40:25 +02:00 |
Claudio Atzori
|
baed5e3337
|
test classes moved in specific components
|
2021-08-13 12:14:47 +02:00 |
Claudio Atzori
|
3359f73fcf
|
cleanup & best practices
|
2021-08-13 12:00:42 +02:00 |
Miriam Baglioni
|
58f241f4a2
|
GetCSV refactoring - changed due to change of input resource
|
2021-08-13 10:04:44 +02:00 |
Miriam Baglioni
|
f3d575f749
|
GetCSV refactoring - changed due to changes in input resource
|
2021-08-13 10:03:57 +02:00 |
Miriam Baglioni
|
a5f6edfa6c
|
GetCSV refactoring - changed to mirror the original model class
|
2021-08-13 09:30:03 +02:00 |
Miriam Baglioni
|
7402daf51a
|
GetCSV refactoring - added dependency to open-csv lib
|
2021-08-12 17:59:19 +02:00 |
Miriam Baglioni
|
733bcaecf6
|
GetCSV refactoring - added test class (all the tests are disabled since they refer to remote resource)
|
2021-08-12 17:58:52 +02:00 |
Miriam Baglioni
|
bfe8f5335c
|
GetCSV refactoring - copied model classes in test path
|
2021-08-12 17:58:14 +02:00 |
Miriam Baglioni
|
6e84b3951f
|
GetCSV refactoring - moving classes to dhp-common that have dependency with GetCSV class (that was located in graph-mapper)
|
2021-08-12 17:57:41 +02:00 |
Miriam Baglioni
|
9650eea497
|
reverting
|
2021-08-11 17:45:48 +02:00 |
Miriam Baglioni
|
cc3d72df0e
|
removing not needed dependency
|
2021-08-11 17:42:01 +02:00 |
Miriam Baglioni
|
f9b6b45d85
|
reverting
|
2021-08-11 17:04:48 +02:00 |
Miriam Baglioni
|
8da3a25cf6
|
merging with branch beta
|
2021-08-11 15:55:34 +02:00 |
Claudio Atzori
|
2ee21da43b
|
suggestions from SonarLint
|
2021-08-11 12:13:22 +02:00 |
Miriam Baglioni
|
6bd1eca7e0
|
merge branch with beta
|
2021-08-05 15:23:32 +02:00 |
Miriam Baglioni
|
ee13da9258
|
merge branch with master
|
2021-08-05 11:34:20 +02:00 |
Miriam Baglioni
|
1d6ac3715b
|
merge branch with beta
|
2021-07-30 11:58:29 +02:00 |
Claudio Atzori
|
a9961a1835
|
[cleaning] title cleaning based on the me.xuender:unidecode library
|
2021-07-28 16:36:33 +02:00 |
Claudio Atzori
|
6dddad86ee
|
[cleaning] title cleaning based on the me.xuender:unidecode library
|
2021-07-28 16:21:29 +02:00 |
Miriam Baglioni
|
74f801b689
|
mergin with branch beta
|
2021-07-27 13:18:31 +02:00 |
Miriam Baglioni
|
35e395eae8
|
merge with master
|
2021-07-27 12:34:59 +02:00 |
Miriam Baglioni
|
eb07f7f40f
|
Hosted By Map
|
2021-07-27 12:27:26 +02:00 |
Claudio Atzori
|
bc835d2024
|
[cleaning] fixed filtering function for missing titles
|
2021-07-23 11:56:13 +02:00 |
Claudio Atzori
|
ffdb2a3ea3
|
[cleaning] fixed filtering function for missing titles
|
2021-07-23 11:55:55 +02:00 |
Sandro La Bruzzo
|
62ae36a3d2
|
fixed NPE
|
2021-07-22 15:41:38 +02:00 |
Miriam Baglioni
|
63553a76b3
|
added code to download gold issn list from unibi
|
2021-07-22 12:01:48 +02:00 |
Sandro La Bruzzo
|
d94565862a
|
fixed NPE
|
2021-07-21 21:23:11 +02:00 |
Sandro La Bruzzo
|
31d2d6d41e
|
Scholexplorer: introduction of dedup openaire
|
2021-07-21 18:09:32 +02:00 |
Miriam Baglioni
|
d418c309f5
|
removed the part after part-x- in the file name generated by spark. It was too long and created problems while creating the tar entries
|
2021-07-13 17:11:49 +02:00 |
Sandro La Bruzzo
|
ad50415167
|
Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer
|
2021-06-24 17:20:50 +02:00 |
Claudio Atzori
|
67afd06cd1
|
[cleaning] cleaning instance.pid and instance.alternateidentifier using the same procedure used to clean result.pid
|
2021-06-24 12:10:17 +02:00 |
Sandro La Bruzzo
|
cc0f2b11fb
|
Implemented mapping from pubmed baseline to OAF
|
2021-06-16 14:56:24 +02:00 |
Claudio Atzori
|
2039bb9f5f
|
orcid / orcid_pending cleaning backported from master branch
|
2021-06-14 09:40:50 +02:00 |
Claudio Atzori
|
a900bfb874
|
delegating the date parsing to https://github.com/sisyphsu/dateparser
|
2021-06-11 16:53:01 +02:00 |
Claudio Atzori
|
eb6acfbabc
|
[cleaning] removing non parsable relation.validationDate(s)
|
2021-05-28 10:50:44 +02:00 |
Claudio Atzori
|
9d725efdc1
|
reverted implementation of the mdstore client
|
2021-05-20 18:26:09 +02:00 |
Claudio Atzori
|
23b8883ab1
|
applied intellij code cleanup
|
2021-05-14 10:58:12 +02:00 |
Claudio Atzori
|
d4c3476152
|
mapping datasource.journal only when an issn is available, null otherwhise
|
2021-05-11 11:08:54 +02:00 |
Claudio Atzori
|
d1cbee8413
|
imported methods from CleaningFunctions, defined in GraphCleaningFunctions
|
2021-05-10 16:43:39 +02:00 |
Claudio Atzori
|
3797543600
|
MDStoreManager model classes moved in dhp-schemas
|
2021-05-10 14:32:05 +02:00 |
Claudio Atzori
|
b1785ba77c
|
alternative way to set timeouts for the ISLookup client
|
2021-05-05 11:23:46 +02:00 |
Claudio Atzori
|
923d19ea8e
|
mdstore read lock/unlock when bulk copying records from mongodb to hdfs
|
2021-05-04 18:06:21 +02:00 |
Claudio Atzori
|
91e7220f20
|
cleaned up workflow for actionset migration, adjusted dnet|cnr* dependency versions
|
2021-04-29 10:09:52 +02:00 |
Claudio Atzori
|
5afa7d3e0c
|
core utilities in dhp-common moved in external module dhp-schemas
|
2021-04-27 15:44:01 +02:00 |
Claudio Atzori
|
f783e60ff7
|
cleanup
|
2021-04-27 14:04:50 +02:00 |
Claudio Atzori
|
27ab8a704d
|
adjusted poms to align with the external dhp-schema module
|
2021-04-27 10:12:27 +02:00 |
Claudio Atzori
|
c2bb03c8b5
|
depending on external dhp-schemas module
|
2021-04-23 17:57:35 +02:00 |
Claudio Atzori
|
8704d32780
|
code formatting
|
2021-04-15 16:52:58 +02:00 |
Claudio Atzori
|
ba4b4c74d8
|
do not make the identifier prefix depend on the Handle
|
2021-04-15 16:48:26 +02:00 |
Claudio Atzori
|
710cd1e8f2
|
Merge pull request 'add xslt, personname cleaner' (#104) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: #104
LGTM
|
2021-04-13 14:43:05 +02:00 |
Claudio Atzori
|
d1ca025b0b
|
[cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles
|
2021-04-13 14:32:41 +02:00 |
Andreas Czerniak
|
d7614c1f85
|
introduce new const
|
2021-04-13 07:04:27 +02:00 |
Claudio Atzori
|
902d05f548
|
[cleaning] avoiding NPEs handling null author PIDs
|
2021-04-12 17:31:40 +02:00 |
Claudio Atzori
|
72ce741ea6
|
WIP: using common definitions from ModelConstants
|
2021-03-31 17:07:13 +02:00 |
Claudio Atzori
|
27681b876c
|
code formatting
|
2021-03-29 17:47:11 +02:00 |
miconis
|
2709d08fc2
|
Merge branch 'stable_ids' into openorgswf
|
2021-03-29 16:39:07 +02:00 |
Claudio Atzori
|
3becaa5539
|
[Cleaning] drop alternate identifiers with empty values
|
2021-03-29 16:01:35 +02:00 |
Claudio Atzori
|
48f2b6127e
|
[Cleaning] drop alternate identifiers with empty values
|
2021-03-29 14:23:18 +02:00 |
miconis
|
2355cc4e9b
|
minor changes and bug fix
|
2021-03-29 10:07:12 +02:00 |
Claudio Atzori
|
b5b7dc2104
|
[Cleaning] drop alternate identifiers with empty values
|
2021-03-26 12:30:00 +01:00 |
Claudio Atzori
|
827e7e37db
|
[Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid
|
2021-03-25 11:07:59 +01:00 |
Claudio Atzori
|
431cbe9955
|
handle missing instance.pid during bulk cleaning
|
2021-03-23 09:28:58 +01:00 |
Sandro La Bruzzo
|
c73072079d
|
fix conflicts
|
2021-03-22 16:36:31 +01:00 |
Claudio Atzori
|
3256b9c836
|
code formatting
|
2021-03-19 09:36:12 +01:00 |
Claudio Atzori
|
75144dacb3
|
Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids
|
2021-03-19 09:07:40 +01:00 |
Claudio Atzori
|
9588bfba81
|
[cleaning] entries avaialbe as PIDs must not appear as alternateIdentifier
|
2021-03-19 09:07:30 +01:00 |
Sandro La Bruzzo
|
25d5663d97
|
added filter
|
2021-03-18 10:24:42 +01:00 |
Sandro La Bruzzo
|
5f98ea74a9
|
Added fix for pid generation in stableIds
|
2021-03-17 15:53:24 +01:00 |
Claudio Atzori
|
734232d3b9
|
identifier factory doesn't depend on pre-existing entity.id
|
2021-03-17 15:14:53 +01:00 |
Claudio Atzori
|
a3dac32f16
|
pidFilter a bit more permissive
|
2021-03-17 15:06:05 +01:00 |
Claudio Atzori
|
8257f9a2bc
|
result.pid: adjusted the mapping applied to the contents from the aggregator
|
2021-03-17 12:45:38 +01:00 |
Claudio Atzori
|
3b2da86f0a
|
added precondition on IdentifierFactory to check the presence of entity.id
|
2021-03-16 17:05:38 +01:00 |
Claudio Atzori
|
640b885706
|
added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator
|
2021-03-16 14:19:32 +01:00 |