Old OpenAlex API documentation
  • Overview
  • Quickstart tutorial
  • API Entities
    • Entities overview
    • đź“„Works
      • Work object
        • Authorship object
        • Location object
      • Get a single work
      • Get lists of works
      • Filter works
      • Search works
      • Group works
      • Get N-grams
    • đź‘©Authors
      • Author object
      • Get a single author
      • Get lists of authors
      • Filter authors
      • Search authors
      • Group authors
      • Limitations
      • Author disambiguation
    • 📚Sources
      • Source object
      • Get a single source
      • Get lists of sources
      • Filter sources
      • Search sources
      • Group sources
    • 🏫Institutions
      • Institution object
      • Get a single institution
      • Get lists of institutions
      • Filter institutions
      • Search institutions
      • Group institutions
    • đź’ˇTopics
      • Topic object
      • Get a single topic
      • Get lists of topics
      • Filter topics
      • Search topics
      • Group topics
    • 🗝️Keywords
    • 🏢Publishers
      • Publisher object
      • Get a single publisher
      • Get lists of publishers
      • Filter publishers
      • Search publishers
      • Group publishers
    • đź’°Funders
      • Funder object
      • Get a single funder
      • Get lists of funders
      • Filter funders
      • Search funders
      • Group funders
    • 🌎Geo
      • Continents
      • Regions
    • đź’ˇConcepts
      • Concept object
      • Get a single concept
      • Get lists of concepts
      • Filter concepts
      • Search concepts
      • Group concepts
    • Aboutness endpoint (/text)
  • How to use the API
    • API Overview
    • Get single entities
      • Random result
      • Select fields
    • Get lists of entities
      • Paging
      • Filter entity lists
      • Search entities
      • Sort entity lists
      • Select fields
      • Sample entity lists
      • Autocomplete entities
    • Get groups of entities
    • Rate limits and authentication
  • Download all data
    • OpenAlex snapshot
    • Snapshot data format
    • Download to your machine
    • Upload to your database
      • Load to a data warehouse
      • Load to a relational database
        • Postgres schema diagram
  • Additional Help
    • Tutorials
    • Report bugs
    • FAQ
Powered by GitBook
On this page
  • Step 1: Create a BigQuery Project and Dataset
  • Step 2: Create tables for each entity type
  • Step 3: Load the data files
  • Step 4: Run your queries!
  1. Download all data
  2. Upload to your database

Load to a data warehouse

PreviousUpload to your databaseNextLoad to a relational database

Last updated 2 years ago

In many data warehouse and document store applications, you can load the OpenAlex entities as-is and query them directly. We’ll use as an example here. ( docs coming soon). To follow along you’ll need the . You’ll also need a Google account that can make BigQuery tables that are, well… big. Which means it probably won’t be free.

We'll show you how to do this in 4 steps:

  1. Create a BigQuery Project and Dataset to hold your tables

  2. Create the tables that will hold your entity JSON records

  3. Copy the data files to the tables you created

  4. Run some queries on the data you loaded

This guide will have you load each entity to a single text column, then use BigQuery's JSON functions to parse them when you run your queries. This is convenient but inefficient since each object has to be parsed every time you run a query.

This project, kindly shared by , takes a more efficient approach:

Separating the Entity data into multiple columns takes more work up front but lets you write queries that are faster, simpler, and often .

users can connect to a ready-to-query data set on the marketplace, helpfully maintained by -

Step 1: Create a BigQuery Project and Dataset

In BigQuery, you need a and to hold your tables. We’ll call the project “openalex-demo” and the dataset “openalex”. Follow the linked instructions to create the Project, then create the dataset inside it:

bq mk openalex-demo:openalex

Dataset 'openalex-demo:openalex' successfully created

Step 2: Create tables for each entity type

bq mk --table openalex-demo:openalex.works work:string

Table 'openalex-demo:openalex.works' successfully created.

bq mk --table openalex-demo:openalex.authors author:string

Table 'openalex-demo:openalex.authors' successfully created

and so on for sources, institutions, concepts, and publishers.

Step 3: Load the data files

We’ll load each table’s data from the JSON Lines files we downloaded earlier. For works, the files were:

  • openalex-snapshot/data/works/updated_date=2021-12-28/0000_part_00.gz

  • openalex-snapshot/data/works/updated_date=2021-12-28/0001_part_00.gz

Here’s a command to load one works file (don’t run it yet):

bq load \
--project_id openalex-demo \
--source_format=CSV -F '\t' \
--schema 'work:string' \
openalex.works \
'openalex-snapshot/data/works/updated_date=2021-12-28/0000_part_00.gz'

This part of the command may need some explanation:

--source_format=CSV -F '\t' --schema 'work:string'

bq load can only handle one file at a time, so you must run this command once per file. But remember that the real dataset will have many more files than this example does, so it's impractical to copy, edit, and rerun the command each time. It's easier to handle all the files in a loop, like this:

for data_file in openalex-snapshot/data/works/*/*.gz;
do
    bq load --source_format=CSV -F '\t' \
        --schema 'work:string' \
        --project_id openalex-demo \
        openalex.works $data_file;
done

This step is slow. How slow depends on your upload speed, but for Author and Work we're talking hours, not minutes.

Do this once per entity type, substituting each entity name for work/works as needed. When you’re finished, you’ll have five tables that look like this:

Step 4: Run your queries!

Here’s a simple one, extracting the OpenAlex ID and OA status for each work:

select 
    json_value(work, '$.id') as work_id, 
    json_value(work, '$.open_access.is_oa') as is_oa
from
    `openalex-demo.openalex.works`;

It will give you a list of IDs (this is a truncated sample, the real result will be millions of rows):

TRUE

FALSE

FALSE

You can run queries like this directly in your shell:

bq query \
--project_id=openalex-demo \
--use_legacy_sql=false \
"select json_value(work, '$.id') as work_id, json_value(work, '$.open_access.is_oa') as is_oa from openalex.works;"

But even simple queries are hard to read and edit this way. It’s better to write them in a file than directly on the command line. Here’s an example of a slightly more complex query - finding the author with the most open access works of all time:

with work_authorships_oa as (
   select
       json_value(work, '$.id') as work_id,
       json_query_array(work, '$.authorships') as authorships,
       cast(json_value(work, '$.open_access.is_oa') as BOOL) as is_oa
   from `openalex-demo.openalex.works`
), flat_authorships as (
   select work_id, authorship, is_oa
   from work_authorships_oa,
   unnest(authorships) as authorship
)
select 
    json_value(authorship, '$.author.id') as author_id,
    count(distinct work_id) as num_oa_works
from flat_authorships
where is_oa
group by author_id
order by num_oa_works desc
limit 1;

We get one result:

author_id
num_oa_works

https://openalex.org/A2798520857

3297

Now, we’ll inside the dataset. There will be 5 tables, one for each entity type. Since we’re using JSON, each table will have just one text column named after the table.

See the full documentation for the bq load command here:

Bigquery is expecting multiple columns with predefined datatypes (a “schema”). We’re tricking it into accepting a single text column (--schema 'work:string') by specifying format (--source_format=CSV) with a column delimiter that isn’t present in the file (-F '\t') (\t means “tab”).

You can speed this up by using or other tools to run multiple upload commands at once. If you do, watch out for errors caused by hitting limits.

Now you have the all the OpenAlex data in a place where you can do anything you want with it using through or the BigQuery .

Checking out , we see that this is Ashok Kumar at Manipal University Jaipur.

BigQuery
Elasticsearch
Google Cloud SDK
@DShvadron
https://github.com/DrorSh/openalex_to_gbq
cheaper
Snowflake
Util
https://app.snowflake.com/marketplace/listing/GZT0ZOMX4O7
Project
Dataset
create tables
https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_load
CSV
parallel
BigQuery quota
BigQuery JSON functions
bq query
console
https://api.openalex.org/authors/A2798520857
https://openalex.org/W2741809807
https://openalex.org/W1491283979
https://openalex.org/W1491315632
a screenshot of two rows of the works table from the BigQuery console