Example Dataset
In this user guide series, we'll explore an example dataset comprising details of over 5,000 Medium articles. These were published between January 2020 and August 2020 in notable publications.
Acquire the dataset
The dataset resides in a publicly accessible S3 storage bucket. To fetch it, execute the following command:
# Get a CSV version of the dataset
curl https://assets.zilliz.com/medium_articles_2020_dpr_a13e0377ae.csv \
--output medium_articles_2020_dpr.csv
# Get a JSON version of the dataset
curl https://assets.zilliz.com/medium_articles_2020_dpr_a13e0377ae.json \
--output medium_articles_2020_dpr.json
Expected output
# Output
#
# % Total % Received % Xferd Average Speed Time Time Time Current
# Dload Upload Total Spent Left Speed
# 100 60.4M 100 60.4M 0 0 12.8M 0 0:00:04 0:00:04 --:--:-- 12.9M
For a comprehensive understanding of the dataset, refer to its introduction page on Kaggle.
The acquired dataset is in JSON format, with a structure resembling:
{
"root": [
{
"id": ...
"title_vector": ...
"title": ...
"link": ...
"reading_time": ...
"publication": ...
"claps": ...
"responses": ...
},
...
]
}
Dataset schema
Each record in the dataset possesses eight attributes. Familiarize yourself with this structure as it will guide you when establishing the schema for your collection.
Field Name | Type | Attributes |
---|---|---|
id | INT64 | N/A |
title_vector | FLOAT_VECTOR | Dimension: 768 |
title | VARCHAR | Max length: 512 |
link | VARCHAR | Max length: 512 |
reading_time | INT64 | N/A |
publication | VARCHAR | Max length: 512 |
claps | INT64 | N/A |
responses | INT64 | N/A |