Claudio Atzori
|
bd9a43cefd
|
Revert to 4094f2bb9a
|
2021-11-19 09:20:43 +01:00 |
Claudio Atzori
|
82a4e4efae
|
[cleaning wf] fixed methodology to rule out invalid result titles, based on https://support.openaire.eu/issues/7206
|
2021-11-17 14:17:22 +01:00 |
Sandro La Bruzzo
|
60ae874dcb
|
Merge branch 'beta' of code-repo.d4science.org:D-Net/dnet-hadoop into mvn_site_documentation
|
2021-11-17 11:08:34 +01:00 |
Claudio Atzori
|
49f897ef29
|
[cleaning wf] fixed regex used to spot garbage in result titles; adjusted threshold for filtering titles
|
2021-11-16 15:24:23 +01:00 |
Sandro La Bruzzo
|
a1cafaf2e3
|
added mvn site for dnet-hadoop project
|
2021-11-16 15:16:28 +01:00 |
Sandro La Bruzzo
|
aafdffa6b3
|
resolved conflict
|
2021-10-26 09:45:46 +02:00 |
Sandro La Bruzzo
|
034304b33a
|
conflict resolved on merge
|
2021-10-26 09:40:47 +02:00 |
Claudio Atzori
|
6b34ba737e
|
minor
|
2021-10-21 14:16:18 +02:00 |
Sandro La Bruzzo
|
ae4e99a471
|
Adapted workflow of resolution of PID to work into OpenAIRE data workflow
- Added relations in both verse on all Scholexplorer datasources
|
2021-10-20 17:12:16 +02:00 |
Miriam Baglioni
|
c8321ad31a
|
merge with branch beta
|
2021-10-01 12:59:08 +02:00 |
Claudio Atzori
|
663b1556d7
|
manually integrating PR#140 D-Net/dnet-hadoop#140
|
2021-09-15 16:40:25 +02:00 |
Claudio Atzori
|
baed5e3337
|
test classes moved in specific components
|
2021-08-13 12:14:47 +02:00 |
Claudio Atzori
|
3359f73fcf
|
cleanup & best practices
|
2021-08-13 12:00:42 +02:00 |
Miriam Baglioni
|
58f241f4a2
|
GetCSV refactoring - changed due to change of input resource
|
2021-08-13 10:04:44 +02:00 |
Miriam Baglioni
|
f3d575f749
|
GetCSV refactoring - changed due to changes in input resource
|
2021-08-13 10:03:57 +02:00 |
Miriam Baglioni
|
a5f6edfa6c
|
GetCSV refactoring - changed to mirror the original model class
|
2021-08-13 09:30:03 +02:00 |
Miriam Baglioni
|
7402daf51a
|
GetCSV refactoring - added dependency to open-csv lib
|
2021-08-12 17:59:19 +02:00 |
Miriam Baglioni
|
733bcaecf6
|
GetCSV refactoring - added test class (all the tests are disabled since they refer to remote resource)
|
2021-08-12 17:58:52 +02:00 |
Miriam Baglioni
|
bfe8f5335c
|
GetCSV refactoring - copied model classes in test path
|
2021-08-12 17:58:14 +02:00 |
Miriam Baglioni
|
6e84b3951f
|
GetCSV refactoring - moving classes to dhp-common that have dependency with GetCSV class (that was located in graph-mapper)
|
2021-08-12 17:57:41 +02:00 |
Miriam Baglioni
|
9650eea497
|
reverting
|
2021-08-11 17:45:48 +02:00 |
Miriam Baglioni
|
cc3d72df0e
|
removing not needed dependency
|
2021-08-11 17:42:01 +02:00 |
Miriam Baglioni
|
f9b6b45d85
|
reverting
|
2021-08-11 17:04:48 +02:00 |
Miriam Baglioni
|
8da3a25cf6
|
merging with branch beta
|
2021-08-11 15:55:34 +02:00 |
Claudio Atzori
|
2ee21da43b
|
suggestions from SonarLint
|
2021-08-11 12:13:22 +02:00 |
Miriam Baglioni
|
6bd1eca7e0
|
merge branch with beta
|
2021-08-05 15:23:32 +02:00 |
Miriam Baglioni
|
ee13da9258
|
merge branch with master
|
2021-08-05 11:34:20 +02:00 |
Miriam Baglioni
|
1d6ac3715b
|
merge branch with beta
|
2021-07-30 11:58:29 +02:00 |
Claudio Atzori
|
a9961a1835
|
[cleaning] title cleaning based on the me.xuender:unidecode library
|
2021-07-28 16:36:33 +02:00 |
Claudio Atzori
|
6dddad86ee
|
[cleaning] title cleaning based on the me.xuender:unidecode library
|
2021-07-28 16:21:29 +02:00 |
Miriam Baglioni
|
74f801b689
|
mergin with branch beta
|
2021-07-27 13:18:31 +02:00 |
Miriam Baglioni
|
35e395eae8
|
merge with master
|
2021-07-27 12:34:59 +02:00 |
Miriam Baglioni
|
eb07f7f40f
|
Hosted By Map
|
2021-07-27 12:27:26 +02:00 |
Claudio Atzori
|
bc835d2024
|
[cleaning] fixed filtering function for missing titles
|
2021-07-23 11:56:13 +02:00 |
Claudio Atzori
|
ffdb2a3ea3
|
[cleaning] fixed filtering function for missing titles
|
2021-07-23 11:55:55 +02:00 |
Sandro La Bruzzo
|
62ae36a3d2
|
fixed NPE
|
2021-07-22 15:41:38 +02:00 |
Miriam Baglioni
|
63553a76b3
|
added code to download gold issn list from unibi
|
2021-07-22 12:01:48 +02:00 |
Sandro La Bruzzo
|
d94565862a
|
fixed NPE
|
2021-07-21 21:23:11 +02:00 |
Sandro La Bruzzo
|
31d2d6d41e
|
Scholexplorer: introduction of dedup openaire
|
2021-07-21 18:09:32 +02:00 |
Miriam Baglioni
|
d418c309f5
|
removed the part after part-x- in the file name generated by spark. It was too long and created problems while creating the tar entries
|
2021-07-13 17:11:49 +02:00 |
Sandro La Bruzzo
|
ad50415167
|
Merge remote-tracking branch 'origin/stable_ids' into stable_id_scholexplorer
|
2021-06-24 17:20:50 +02:00 |
Claudio Atzori
|
67afd06cd1
|
[cleaning] cleaning instance.pid and instance.alternateidentifier using the same procedure used to clean result.pid
|
2021-06-24 12:10:17 +02:00 |
Sandro La Bruzzo
|
cc0f2b11fb
|
Implemented mapping from pubmed baseline to OAF
|
2021-06-16 14:56:24 +02:00 |
Claudio Atzori
|
2039bb9f5f
|
orcid / orcid_pending cleaning backported from master branch
|
2021-06-14 09:40:50 +02:00 |
Claudio Atzori
|
a900bfb874
|
delegating the date parsing to https://github.com/sisyphsu/dateparser
|
2021-06-11 16:53:01 +02:00 |
Claudio Atzori
|
eb6acfbabc
|
[cleaning] removing non parsable relation.validationDate(s)
|
2021-05-28 10:50:44 +02:00 |
Claudio Atzori
|
9d725efdc1
|
reverted implementation of the mdstore client
|
2021-05-20 18:26:09 +02:00 |
Claudio Atzori
|
23b8883ab1
|
applied intellij code cleanup
|
2021-05-14 10:58:12 +02:00 |
Claudio Atzori
|
d4c3476152
|
mapping datasource.journal only when an issn is available, null otherwhise
|
2021-05-11 11:08:54 +02:00 |
Claudio Atzori
|
d1cbee8413
|
imported methods from CleaningFunctions, defined in GraphCleaningFunctions
|
2021-05-10 16:43:39 +02:00 |
Claudio Atzori
|
3797543600
|
MDStoreManager model classes moved in dhp-schemas
|
2021-05-10 14:32:05 +02:00 |
Claudio Atzori
|
b1785ba77c
|
alternative way to set timeouts for the ISLookup client
|
2021-05-05 11:23:46 +02:00 |
Claudio Atzori
|
923d19ea8e
|
mdstore read lock/unlock when bulk copying records from mongodb to hdfs
|
2021-05-04 18:06:21 +02:00 |
Claudio Atzori
|
91e7220f20
|
cleaned up workflow for actionset migration, adjusted dnet|cnr* dependency versions
|
2021-04-29 10:09:52 +02:00 |
Claudio Atzori
|
5afa7d3e0c
|
core utilities in dhp-common moved in external module dhp-schemas
|
2021-04-27 15:44:01 +02:00 |
Claudio Atzori
|
f783e60ff7
|
cleanup
|
2021-04-27 14:04:50 +02:00 |
Claudio Atzori
|
27ab8a704d
|
adjusted poms to align with the external dhp-schema module
|
2021-04-27 10:12:27 +02:00 |
Claudio Atzori
|
c2bb03c8b5
|
depending on external dhp-schemas module
|
2021-04-23 17:57:35 +02:00 |
Claudio Atzori
|
8704d32780
|
code formatting
|
2021-04-15 16:52:58 +02:00 |
Claudio Atzori
|
ba4b4c74d8
|
do not make the identifier prefix depend on the Handle
|
2021-04-15 16:48:26 +02:00 |
Claudio Atzori
|
710cd1e8f2
|
Merge pull request 'add xslt, personname cleaner' (#104) from andreas.czerniak/BrStableId_dnet-hadoop:stable_ids into stable_ids
Reviewed-on: D-Net/dnet-hadoop#104
LGTM
|
2021-04-13 14:43:05 +02:00 |
Claudio Atzori
|
d1ca025b0b
|
[cleaning] remiving authors without fullname or providing 'deactivated' keyword. Removing test test titles
|
2021-04-13 14:32:41 +02:00 |
Andreas Czerniak
|
d7614c1f85
|
introduce new const
|
2021-04-13 07:04:27 +02:00 |
Claudio Atzori
|
902d05f548
|
[cleaning] avoiding NPEs handling null author PIDs
|
2021-04-12 17:31:40 +02:00 |
Claudio Atzori
|
72ce741ea6
|
WIP: using common definitions from ModelConstants
|
2021-03-31 17:07:13 +02:00 |
Claudio Atzori
|
27681b876c
|
code formatting
|
2021-03-29 17:47:11 +02:00 |
miconis
|
2709d08fc2
|
Merge branch 'stable_ids' into openorgswf
|
2021-03-29 16:39:07 +02:00 |
Claudio Atzori
|
3becaa5539
|
[Cleaning] drop alternate identifiers with empty values
|
2021-03-29 16:01:35 +02:00 |
Claudio Atzori
|
48f2b6127e
|
[Cleaning] drop alternate identifiers with empty values
|
2021-03-29 14:23:18 +02:00 |
miconis
|
2355cc4e9b
|
minor changes and bug fix
|
2021-03-29 10:07:12 +02:00 |
Claudio Atzori
|
b5b7dc2104
|
[Cleaning] drop alternate identifiers with empty values
|
2021-03-26 12:30:00 +01:00 |
Claudio Atzori
|
827e7e37db
|
[Cleaning] drop instance.alternateIdentifier elements when they are available among instance.pid
|
2021-03-25 11:07:59 +01:00 |
Claudio Atzori
|
431cbe9955
|
handle missing instance.pid during bulk cleaning
|
2021-03-23 09:28:58 +01:00 |
Sandro La Bruzzo
|
c73072079d
|
fix conflicts
|
2021-03-22 16:36:31 +01:00 |
Claudio Atzori
|
3256b9c836
|
code formatting
|
2021-03-19 09:36:12 +01:00 |
Claudio Atzori
|
75144dacb3
|
Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids
|
2021-03-19 09:07:40 +01:00 |
Claudio Atzori
|
9588bfba81
|
[cleaning] entries avaialbe as PIDs must not appear as alternateIdentifier
|
2021-03-19 09:07:30 +01:00 |
Sandro La Bruzzo
|
25d5663d97
|
added filter
|
2021-03-18 10:24:42 +01:00 |
Sandro La Bruzzo
|
5f98ea74a9
|
Added fix for pid generation in stableIds
|
2021-03-17 15:53:24 +01:00 |
Claudio Atzori
|
734232d3b9
|
identifier factory doesn't depend on pre-existing entity.id
|
2021-03-17 15:14:53 +01:00 |
Claudio Atzori
|
a3dac32f16
|
pidFilter a bit more permissive
|
2021-03-17 15:06:05 +01:00 |
Claudio Atzori
|
8257f9a2bc
|
result.pid: adjusted the mapping applied to the contents from the aggregator
|
2021-03-17 12:45:38 +01:00 |
Claudio Atzori
|
3b2da86f0a
|
added precondition on IdentifierFactory to check the presence of entity.id
|
2021-03-16 17:05:38 +01:00 |
Claudio Atzori
|
640b885706
|
added instance.alternativeIdentifiers to the graph model, adjusted the mapping applied to the contents from the aggregator
|
2021-03-16 14:19:32 +01:00 |
Claudio Atzori
|
f74e464942
|
create bestaccessright as Qualifier
|
2021-03-10 15:40:05 +01:00 |
Claudio Atzori
|
c801ab6c1d
|
minor
|
2021-03-09 17:22:31 +01:00 |
Claudio Atzori
|
9917d7e01c
|
PID authorities include ArXiv
|
2021-03-09 17:12:52 +01:00 |
Claudio Atzori
|
01630f638d
|
IdentifierFactory implementation based on the list of datasources authoritative for a given pid type
|
2021-03-09 17:11:50 +01:00 |
Claudio Atzori
|
b3f3b895e5
|
[#6282 open access status in the Graph] OAStatus renamed as openAccessRoute
|
2021-03-09 11:41:11 +01:00 |
Claudio Atzori
|
765f9bdee7
|
merged from dhp_oaf_model
|
2021-03-09 11:37:41 +01:00 |
Claudio Atzori
|
d525785497
|
[#6282 open access status in the Graph] Result.Instance.accessRight defined with dedicated data type that includes the open access color.
|
2021-03-09 11:12:55 +01:00 |
Claudio Atzori
|
8d2bb24512
|
merged from master
|
2021-03-08 15:44:34 +01:00 |
Claudio Atzori
|
fa7930d2e2
|
merging contributions from PR#97
|
2021-03-05 15:45:28 +01:00 |
Claudio Atzori
|
ec80b7ade3
|
code formatting
|
2021-03-03 10:22:53 +01:00 |
Claudio Atzori
|
b73dce3e3a
|
more logging on the MDStore mongodb client. Forcing UTF_8 encoding on the content
|
2021-03-03 10:17:16 +01:00 |
Claudio Atzori
|
e76c4f62c1
|
MetadataRecord moved in dhp-schemas
|
2021-02-26 10:58:48 +01:00 |
Claudio Atzori
|
b830e33392
|
mdstore collector plugin
|
2021-02-25 12:30:30 +01:00 |
Claudio Atzori
|
dc98c39500
|
more logging
|
2021-02-25 12:29:18 +01:00 |
Claudio Atzori
|
fc3fa5e343
|
implemented mdstore collector plugin
|
2021-02-24 15:07:24 +01:00 |
Claudio Atzori
|
cf27905a71
|
WIP: collectorWorker error reporting, added report messages
|
2021-02-16 16:53:14 +01:00 |
Claudio Atzori
|
58288a95b8
|
WIP: collectorWorker error reporting, added report messages
|
2021-02-15 15:28:53 +01:00 |
Claudio Atzori
|
1abe6d1ad7
|
WIP: collectorWorker error reporting, added report messages
|
2021-02-15 15:08:59 +01:00 |
Claudio Atzori
|
29c6f7e255
|
classes related to the collection workflow moved into common package; implemented MongoDB collection plugins
|
2021-02-12 12:31:02 +01:00 |
Claudio Atzori
|
50add4c61b
|
added requestDelay to HttpConnector2 configuration; Aggregation workflow constants moved in dhp-common
|
2021-02-08 12:19:38 +01:00 |
Claudio Atzori
|
40df0f987d
|
better logging, WIP: collectorWorker error reporting; common functions moved in DHPUtils
|
2021-02-06 20:12:00 +01:00 |
Claudio Atzori
|
a8a758925e
|
better logging, WIP: collectorWorker error reporting
|
2021-02-05 19:18:05 +01:00 |
Michele Artini
|
2ee0c3e47e
|
http entity as json string
|
2021-02-05 09:45:39 +01:00 |
Claudio Atzori
|
730973679a
|
Merge branch 'hadoop_aggregator' of https://code-repo.d4science.org/D-Net/dnet-hadoop into hadoop_aggregator
|
2021-02-04 17:25:00 +01:00 |
Claudio Atzori
|
deb85706db
|
imported HttpConnector from https://svn.driver.research-infrastructures.eu/driver/dnet45/modules/dnet-modular-collector-service/trunk/src/main/java/eu/dnetlib/data/collector/plugins/HttpConnector.java as HttpConnector2
|
2021-02-04 17:24:52 +01:00 |
Sandro La Bruzzo
|
4dae5e605d
|
implemented messaging btween collection worker and Dnet
|
2021-02-04 15:51:15 +01:00 |
Claudio Atzori
|
72c57b28fa
|
switched project version to 1.2.4-branch_hadoop_aggregator-SNAPSHOT
|
2021-02-04 14:08:18 +01:00 |
Claudio Atzori
|
40764cf626
|
better logging, WIP: collectorWorker error reporting
|
2021-02-04 14:06:02 +01:00 |
Michele Artini
|
26d2eb946f
|
messages sender
|
2021-02-04 09:45:46 +01:00 |
Michele Artini
|
1b9731632b
|
Message Sender
|
2021-02-03 16:42:36 +01:00 |
Michele Artini
|
820d729e99
|
recover of Message and MessageType class
|
2021-02-03 16:20:34 +01:00 |
Claudio Atzori
|
0e8a4f9f1a
|
better logging, WIP: collectorWorker error reporting
|
2021-02-03 12:33:41 +01:00 |
Claudio Atzori
|
d62ea1490d
|
cleaned up RabbitMQ stuff
|
2021-02-02 10:53:19 +01:00 |
Claudio Atzori
|
73d772a4b4
|
added method to list the known vocabulary names
|
2021-02-02 10:39:47 +01:00 |
Claudio Atzori
|
8eaa1fd4b4
|
WIP: metadata collection in INCREMENTAL mode and relative test
|
2021-02-01 19:29:10 +01:00 |
Sandro La Bruzzo
|
6ff234d81b
|
Implemented a first prototype of incremental harvesting and trasformation using readlock
|
2021-02-01 13:56:05 +01:00 |
Sandro La Bruzzo
|
0276180039
|
WIP mdstore
transaction implemented on hadoop side
|
2021-01-29 16:42:41 +01:00 |
Michele Artini
|
d942d0c77d
|
methods toString(), hashCode() and equals()
|
2021-01-29 13:16:48 +01:00 |
Michele Artini
|
38f2508c87
|
new fields in mdstore beans
|
2021-01-28 08:24:45 +01:00 |
Sandro La Bruzzo
|
a54848a59c
|
Moved Vocabulary stuff to common module
|
2021-01-25 15:43:04 +01:00 |
Claudio Atzori
|
28460c2cd1
|
using com.fasterxml.jackson.databind.ObjectMapper instead of org.codehaus.jackson.map.ObjectMapper
|
2020-12-23 16:59:52 +01:00 |
Claudio Atzori
|
6848d0c3d7
|
trivial: avoid duplicated code
|
2020-12-23 12:21:58 +01:00 |
Claudio Atzori
|
d8b5f43a7e
|
code formatting
|
2020-12-22 14:59:03 +01:00 |
miconis
|
794e22b09c
|
bug fix in the authormerge: now authors with higher size have priority, normalization of author name fixed
|
2020-12-21 17:51:42 +01:00 |
Claudio Atzori
|
12e2f930c8
|
resolved conflicts
|
2020-12-10 10:57:39 +01:00 |
Alessia Bardi
|
112da6d76a
|
in theory, just auto-formatting after mvn compile
|
2020-12-09 20:00:27 +01:00 |
Miriam Baglioni
|
6fbc67a959
|
using ModelConstant.ORCID and removing not used constants
|
2020-12-09 17:10:20 +01:00 |
Claudio Atzori
|
3c5ce1dada
|
code formatting
|
2020-12-09 17:07:20 +01:00 |
Miriam Baglioni
|
212b52614f
|
added graph mapper versus community result without context and project in common to be used for the doiboost mapping
|
2020-12-09 16:59:02 +01:00 |
Claudio Atzori
|
491ad24750
|
introduced filtering for DOIs in graph cleaning workflow
|
2020-12-09 09:10:33 +01:00 |
Claudio Atzori
|
943b961cf6
|
introduced PidBlacklist
|
2020-12-02 09:30:34 +01:00 |
Claudio Atzori
|
893ac4a77b
|
GenerateEntitiesApplication can be configured to hash the id value or not
|
2020-12-02 09:30:06 +01:00 |
Claudio Atzori
|
349e7246aa
|
do not consider NCID, GBIF as PIDs candidate for the ID creation
|
2020-11-30 16:52:40 +01:00 |
Claudio Atzori
|
2c407e775e
|
GenerateEntitiesApplication can be configured to hash the id value or not
|
2020-11-30 12:00:38 +01:00 |
Claudio Atzori
|
758d27745d
|
cleaning tab characters from text fields
|
2020-11-27 16:07:24 +01:00 |
Claudio Atzori
|
596a2a459d
|
added testing class for OafMapperUtils
|
2020-11-27 12:01:11 +01:00 |
Claudio Atzori
|
fa66e5b6b8
|
ResultTypeComparator gives priority to Records collectedfrom Crossref
|
2020-11-26 13:09:19 +01:00 |
Claudio Atzori
|
d0d5525d40
|
minor changes
|
2020-11-26 11:04:17 +01:00 |
Miriam Baglioni
|
66c0e3e574
|
changed because of D-Net/dnet-hadoop#61 (comment)
|
2020-11-25 17:52:17 +01:00 |
Claudio Atzori
|
1372a4d1bf
|
fixed merging method
|
2020-11-25 16:05:51 +01:00 |
Claudio Atzori
|
dfd6205b95
|
Consistency graph workflow merges all the entities by ID
|
2020-11-25 14:55:32 +01:00 |
Claudio Atzori
|
e1a1bb3ee4
|
moved class CleaningFunctions in the correct package. Remove newlines from titles, descriptions, subjects
|
2020-11-24 18:34:03 +01:00 |
Claudio Atzori
|
e43ab07af6
|
code formatting
|
2020-11-24 14:41:39 +01:00 |
Miriam Baglioni
|
73dbb79602
|
removed the checl for the community name in the common version on MakeTar
|
2020-11-24 14:36:15 +01:00 |
Claudio Atzori
|
c016cc050a
|
IdentifierFactory: in case a record provides more than one pid of the same type, the the lexicographically lower value is chosen as best pick
|
2020-11-23 19:16:40 +01:00 |
Claudio Atzori
|
3f34757c63
|
merged from master
|
2020-11-19 14:34:54 +01:00 |