SWH integration produced no data #377

Open
opened 2024-01-16 12:43:32 +01:00 by claudio.atzori · 0 comments

The latest executions on BETA and PROD of the SWH integration workflow produced empty actionsets.

Note that the workflow currently assumes to receive as input two redundant parameters

  • hiveDbName
  • graphPath

In fact, both contain the same contents, but the 1st should be preferred because the hdfs graph path might not exist upon consecutive content update cycles, while the DBs in hive have a longer retention timeframe.

For the sake of reproducibility, below the input parameters used in the last BETA execution

<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>yarnRM</value>
  </property>
  <property>
    <name>apiAccessToken</name>
    <value>xxxx</value>
  </property>
  <property>
    <name>sparkExecutorCores</name>
    <value>4</value>
  </property>
  <property>
    <name>sparkExtraOPT</name>
    <value>--conf spark.extraListeners="com.cloudera.spark.lineage.NavigatorAppListener" --conf spark.sql.queryExecutionListeners="com.cloudera.spark.lineage.NavigatorQueryListener"</value>
  </property>
  <property>
    <name>archiveRequestsPath</name>
    <value>/tmp/beta_provision/working_dir/swh/3_archive_requests.seq</value>
  </property>
  <property>
    <name>retryDelay</name>
    <value>1</value>
  </property>
  <property>
    <name>graphPath</name>
    <value>/tmp/beta_provision/graph/21_graph_blacklisted</value>
  </property>
  <property>
    <name>requestDelay</name>
    <value>100</value>
  </property>
  <property>
    <name>hiveDbName</name>
    <value>openaire_beta_20231208</value>
  </property>
  <property>
    <name>workingDir</name>
    <value>/tmp/beta_provision/working_dir/swh</value>
  </property>
  <property>
    <name>resumeFrom</name>
    <value>collect-software-repository-urls</value>
  </property>
  <property>
    <name>oozie.wf.application.path</name>
    <value>/lib/dnet/BETA/actionmanager/swh/oozie_app</value>
  </property>
  <property>
    <name>sparkSqlWarehouseDir</name>
    <value>/user/hive/warehouse</value>
  </property>
  <property>
    <name>sparkDriverMemory</name>
    <value>4G</value>
  </property>
  <property>
    <name>user.name</name>
    <value>dnet.beta</value>
  </property>
  <property>
    <name>maxNumberOfRetry</name>
    <value>2</value>
  </property>
  <property>
    <name>actionsetsPath</name>
    <value>/var/lib/dnet/actionManager_BETA/swh-entities-software/rawset_832434e0-019c-48da-9892-54022518bdcb_1705401222733</value>
  </property>
  <property>
    <name>lastVisitsPath</name>
    <value>/tmp/beta_provision/working_dir/swh/2_last_visits.seq</value>
  </property>
  <property>
    <name>sparkExecutorMemory</name>
    <value>7G</value>
  </property>
  <property>
    <name>queueName</name>
    <value>default</value>
  </property>
  <property>
    <name>softwareCodeRepositoryURLs</name>
    <value>/tmp/beta_provision/working_dir/swh/1_code_repo_urls.csv</value>
  </property>
  <property>
    <name>softwareLimit</name>
    <value>5000</value>
  </property>
</configuration>

The latest executions on BETA and PROD of the SWH integration workflow produced empty actionsets. Note that the workflow currently assumes to receive as input two redundant parameters * hiveDbName * graphPath In fact, both contain the same contents, but the 1st should be preferred because the hdfs graph path might not exist upon consecutive content update cycles, while the DBs in hive have a longer retention timeframe. For the sake of reproducibility, below the input parameters used in the last BETA execution ``` <configuration> <property> <name>mapreduce.jobtracker.address</name> <value>yarnRM</value> </property> <property> <name>apiAccessToken</name> <value>xxxx</value> </property> <property> <name>sparkExecutorCores</name> <value>4</value> </property> <property> <name>sparkExtraOPT</name> <value>--conf spark.extraListeners="com.cloudera.spark.lineage.NavigatorAppListener" --conf spark.sql.queryExecutionListeners="com.cloudera.spark.lineage.NavigatorQueryListener"</value> </property> <property> <name>archiveRequestsPath</name> <value>/tmp/beta_provision/working_dir/swh/3_archive_requests.seq</value> </property> <property> <name>retryDelay</name> <value>1</value> </property> <property> <name>graphPath</name> <value>/tmp/beta_provision/graph/21_graph_blacklisted</value> </property> <property> <name>requestDelay</name> <value>100</value> </property> <property> <name>hiveDbName</name> <value>openaire_beta_20231208</value> </property> <property> <name>workingDir</name> <value>/tmp/beta_provision/working_dir/swh</value> </property> <property> <name>resumeFrom</name> <value>collect-software-repository-urls</value> </property> <property> <name>oozie.wf.application.path</name> <value>/lib/dnet/BETA/actionmanager/swh/oozie_app</value> </property> <property> <name>sparkSqlWarehouseDir</name> <value>/user/hive/warehouse</value> </property> <property> <name>sparkDriverMemory</name> <value>4G</value> </property> <property> <name>user.name</name> <value>dnet.beta</value> </property> <property> <name>maxNumberOfRetry</name> <value>2</value> </property> <property> <name>actionsetsPath</name> <value>/var/lib/dnet/actionManager_BETA/swh-entities-software/rawset_832434e0-019c-48da-9892-54022518bdcb_1705401222733</value> </property> <property> <name>lastVisitsPath</name> <value>/tmp/beta_provision/working_dir/swh/2_last_visits.seq</value> </property> <property> <name>sparkExecutorMemory</name> <value>7G</value> </property> <property> <name>queueName</name> <value>default</value> </property> <property> <name>softwareCodeRepositoryURLs</name> <value>/tmp/beta_provision/working_dir/swh/1_code_repo_urls.csv</value> </property> <property> <name>softwareLimit</name> <value>5000</value> </property> </configuration> ```
claudio.atzori added the
bug
label 2024-01-16 12:43:32 +01:00
schatz was assigned by claudio.atzori 2024-01-16 12:43:32 +01:00
claudio.atzori added this to the FAIRCORE4EOSC project 2024-01-16 12:43:32 +01:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#377
No description provided.