Research Database ETL

Project One Liner
ETL
BI Dashboard
Date
Dec 31, 2024
Featured
Yes
Type
Projects
Year
2025
URL

đź”™

PolyU Research Landscape & Collaboration Analysis (2020-2025)

Project Overview

  • Objective: Analyze PolyU's research performance and global collaboration patterns (2020-2025) using Scopus and OpenAlex data.
  • Deliverable: An interactive Power BI dashboard providing strategic insights into publication trends, citation impact, and key research areas etc.

Key Questions Addressed

  • What are the trends in PolyU's publication volume and output types (articles, reviews, etc.) over the last 5 years?
  • How has the citation impact (average citations, FWCI) of PolyU's research evolved? Which papers are most influential?
  • What is the Open Access publishing rate, and how does it correlate with impact?
  • What are the main research topics PolyU focuses on, and have these shifted?
  • What are the patterns of research collaboration (internal, domestic HK, international)?

Data Sources

  • Scopus API: Used initially to extract a comprehensive list of PolyU publications (DOIs, basic metadata) within the 2020-2025 timeframe and specific document types as baseline dataset.
  • OpenAlex API: Used extensively via Python (pyalex library) to retrieve rich, detailed metadata for each publication identified via its DOI (Digital Object Identifier) extracted from SCOPUS

Implementation (ETL Process)

  • Extract: Retrieved publication, citation count, author, institution, funder data from SCOPUS and OpenAlex Database for the past 5 years, 2020-2025, using SCOPUS and OpenAlex API through their Python library (pybliometrics and pyAlex).
  • Transform: Processed complex API data using Python (pandas), normalizing it into a structured, relational format suitable for database upload, efficient analysis and large data volume management.
  • Load: Loaded the normalized tables into a local Postgres Dtabase and connected Power BI directly for interactive reporting and visualization.

Database Schema (’Publications’ Table as Fact Table)

Remarks:

  1. Schema might take 10 seconds to load
  1. Hover over Tables to explore their relations. It’s interactive!

Data Visualization (2-page Power BI Report)

  • Page 1: PolyU Research Profile
  • Page 2: Research Output Analysis

Try interacting with the dashboard! Click directly on the data fields👆

 

Future Enhancements

  1. Cloud Migration: Migrate database to the cloud (e.g. AWS Redshift Datawarehouse) for improved scalability, accessibility, and integration.
  1. Automated ETL: Implement an automated pipeline for periodic data updates (e.g., weekly or monthly) and auto-refreshed dashboards.
  1. Performance Tracking: Integrate Key Performance Indicators (KPIs) into the dashboard for strategic progress monitoring against goals and historical trends.

Github Repository