Skip to main content
Version: User Guides (Cloud)

Example Dataset

In this user guide series, we'll explore an example dataset comprising details of over 5,000 Medium articles. These were published between January 2020 and August 2020 in notable publications.

Acquire the dataset

The dataset resides in a publicly accessible S3 storage bucket. To fetch it, execute the following command:

# Get a CSV version of the dataset
curl \
--output medium_articles_2020_dpr.csv

# Get a JSON version of the dataset
curl \
--output medium_articles_2020_dpr.json

Expected output

# Output
# % Total % Received % Xferd Average Speed Time Time Time Current
# Dload Upload Total Spent Left Speed
# 100 60.4M 100 60.4M 0 0 12.8M 0 0:00:04 0:00:04 --:--:-- 12.9M

For a comprehensive understanding of the dataset, refer to its introduction page on Kaggle.

The acquired dataset is in JSON format, with a structure resembling:

"root": [
"id": ...
"titile_vector": ...
"title": ...
"link": ...
"reading_time": ...
"publication": ...
"claps": ...
"responses": ...

Dataset schema

Each record in the dataset possesses eight attributes. Familiarize yourself with this structure as it will guide you when establishing the schema for your collection.

Field NameTypeAttributes
title_vectorFLOAT_VECTORDimension: 768
titleVARCHARMax length: 512
linkVARCHARMax length: 512
publicationVARCHARMax length: 512