Import from a Parquet file
Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It offers high-performance compression and encoding schemes to manage complex data in bulk and is supported in various programming languages and analytics tools tools.
You are advised to use the BulkWriter tool to prepare your raw data into Parquet files. The following figure demonstrates how your raw data can be mapped into a Parquet file.
- Whether to enable AutoID
The id field serves as the primary field of the collection. To make the primary field automatically increment, you can enable AutoID in the schema. In this case, you should exclude the id field from each row in the source data.
- Whether to enable dynamic fields
When the target collection enables dynamic fields, if you need to store fields that are not included in the pre-defined schema, you can specify the $meta column during the write operation and provide the corresponding key-value data.
- Case-sensitive
Dictionary keys and collection field names are case-sensitive. Ensure that the dictionary keys in your data exactly match the field names in the target collection. If there is a field named id in the target collection, each entity dictionary should have a key named id. Using ID or Id results in errors.
Directory structure​
If you prefer to prepare your data into Parquet files, place all Parquet files directly into the source data folder as shown in the tree diagram below.
├── parquet-folder
│ ├── 1.parquet
│ └── 2.parquet
Import data​
Once your data is ready, you can use either of the following methods to import them into your Zilliz Cloud collection.
If your files are relatively small, it is recommended to use the folder or multiple-path method to import them all at once. This approach allows for internal optimizations during the import process, which helps reduce resource consumption later.
You can also import your data on the Zilliz Cloud console using Milvus SDKs. For details, refer to Import Data (Console) and Import Data (SDK).
Import files from multiple paths (Recommended)​
When importing files from multiple paths, include each Parquet file path in a separate list, then group all the lists into a higher-level list as in the following code example.
curl --request POST \
--url "https://api.cloud.zilliz.com/v2/vectordb/jobs/import/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Accept: application/json" \
--header "Content-Type: application/json" \
-d '{
"clusterId": "inxx-xxxxxxxxxxxxxxx",
"collectionName": "medium_articles",
"partitionName": "",
"objectUrls": [
["s3://bucket-name/parquet-folder-1/1.parquet"],
["s3://bucket-name/parquet-folder-2/1.parquet"],
["s3://bucket-name/parquet-folder-3/"]
],
"accessKey": "",
"secretKey": ""
}'
Import files from a folder​
If the source folder contains only the Parquet files to import, you can simply include the source folder in the request as follows:
curl --request POST \
--url "https://api.cloud.zilliz.com/v2/vectordb/jobs/import/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Accept: application/json" \
--header "Content-Type: application/json" \
-d '{
"clusterId": "inxx-xxxxxxxxxxxxxxx",
"collectionName": "medium_articles",
"partitionName": "",
"objectUrls": [
["s3://bucket-name/parquet-folder/"]
],
"accessKey": "",
"secretKey": ""
}'
Import a single file​
If your prepared data file is a single Parquet file, import it as demonstrated in the following code example.
curl --request POST \
--url "https://api.cloud.zilliz.com/v2/vectordb/jobs/import/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Accept: application/json" \
--header "Content-Type: application/json" \
-d '{
"clusterId": "inxx-xxxxxxxxxxxxxxx",
"collectionName": "medium_articles",
"partitionName": "",
"objectUrls": [
["s3://bucket-name/parquet-folder/1.parquet"]
],
"accessKey": "",
"secretKey": ""
}'
Storage paths​
Zilliz Cloud supports data import from your cloud storage. The table below lists the possible storage paths for your data files.
Cloud | Quick Examples |
---|---|
AWS S3 |
|
Google Cloud Storage |
|
Azure Bolb |
|
Limits​
There are some limits you need to observe when you import data in the Parquet format from your cloud storage.
Item | Description |
---|---|
Multiple files per import | Yes |
Maximum file size per import | Free cluster: 512 MB in total Serverless & Dedicated cluster
|
Applicable data file locations | Remote files only |
You are advised to use the BulkWriter tool to prepare your raw data into parquet files. Click here to download the prepared sample data based on the schema in the above diagram.