Load to a data warehouse
In many data warehouse and document store applications, you can load the OpenAlex entities as-is and query them directly. Weโll use BigQuery as an example here. (Elasticsearch docs coming soon). To follow along youโll need the Google Cloud SDK. Youโll also need a Google account that can make BigQuery tables that are, wellโฆ big. Which means it probably wonโt be free.
We'll show you how to do this in 4 steps:
Create a BigQuery Project and Dataset to hold your tables
Create the tables that will hold your entity JSON records
Copy the data files to the tables you created
Run some queries on the data you loaded
Step 1: Create a BigQuery Project and Dataset
In BigQuery, you need a Project and Dataset to hold your tables. Weโll call the project โopenalex-demoโ and the dataset โopenalexโ. Follow the linked instructions to create the Project, then create the dataset inside it:
bq mk openalex-demo:openalex
Dataset 'openalex-demo:openalex' successfully created
Step 2: Create tables for each entity type
Now, weโll create tables inside the dataset. There will be 5 tables, one for each entity type. Since weโre using JSON, each table will have just one text column named after the table.
bq mk --table openalex-demo:openalex.works work:string
Table 'openalex-demo:openalex.works' successfully created.
bq mk --table openalex-demo:openalex.authors author:string
Table 'openalex-demo:openalex.authors' successfully created
and so on for sources
, institutions
, concepts,
and publishers
.
Step 3: Load the data files
Weโll load each tableโs data from the JSON Lines files we downloaded earlier. For works
, the files were:
openalex-snapshot/data/works/updated_date=2021-12-28/0000_part_00.gz
openalex-snapshot/data/works/updated_date=2021-12-28/0001_part_00.gz
Hereโs a command to load one works
file (donโt run it yet):
bq load \
--project_id openalex-demo \
--source_format=CSV -F '\t' \
--schema 'work:string' \
openalex.works \
'openalex-snapshot/data/works/updated_date=2021-12-28/0000_part_00.gz'
This part of the command may need some explanation:
--source_format=CSV -F '\t' --schema 'work:string'
Bigquery is expecting multiple columns with predefined datatypes (a โschemaโ). Weโre tricking it into accepting a single text column (--schema 'work:string'
) by specifying CSV format (--source_format=CSV
) with a column delimiter that isnโt present in the file (-F '\t')
(\t means โtabโ).
bq load
can only handle one file at a time, so you must run this command once per file. But remember that the real dataset will have many more files than this example does, so it's impractical to copy, edit, and rerun the command each time. It's easier to handle all the files in a loop, like this:
for data_file in openalex-snapshot/data/works/*/*.gz;
do
bq load --source_format=CSV -F '\t' \
--schema 'work:string' \
--project_id openalex-demo \
openalex.works $data_file;
done
Do this once per entity type, substituting each entity name for work
/works
as needed. When youโre finished, youโll have five tables that look like this:

Step 4: Run your queries!
Now you have the all the OpenAlex data in a place where you can do anything you want with it using BigQuery JSON functions through bq query or the BigQuery console.
Hereโs a simple one, extracting the OpenAlex ID and OA status for each work:
select
json_value(work, '$.id') as work_id,
json_value(work, '$.open_access.is_oa') as is_oa
from
`openalex-demo.openalex.works`;
It will give you a list of IDs (this is a truncated sample, the real result will be millions of rows):
You can run queries like this directly in your shell:
bq query \
--project_id=openalex-demo \
--use_legacy_sql=false \
"select json_value(work, '$.id') as work_id, json_value(work, '$.open_access.is_oa') as is_oa from openalex.works;"
But even simple queries are hard to read and edit this way. Itโs better to write them in a file than directly on the command line. Hereโs an example of a slightly more complex query - finding the author with the most open access works of all time:
with work_authorships_oa as (
select
json_value(work, '$.id') as work_id,
json_query_array(work, '$.authorships') as authorships,
cast(json_value(work, '$.open_access.is_oa') as BOOL) as is_oa
from `openalex-demo.openalex.works`
), flat_authorships as (
select work_id, authorship, is_oa
from work_authorships_oa,
unnest(authorships) as authorship
)
select
json_value(authorship, '$.author.id') as author_id,
count(distinct work_id) as num_oa_works
from flat_authorships
where is_oa
group by author_id
order by num_oa_works desc
limit 1;
We get one result:
https://openalex.org/A2798520857
3297
Checking out https://api.openalex.org/authors/A2798520857, we see that this is Ashok Kumar at Manipal University Jaipur.
Last updated