Our Archive API allows you to access and retrieve Data Snapshots from Bright Data’s cached data collections in a seamless and efficient method.
To access this API, you will need a Bright Data API key
Run a Search
To initiate a search of our Archive, use the following /search
endpoint.
Endpoint : POST api.brightdata.com/webarchive/search
Request Response Code Example Dictionary POST api . brightdata . com / webarchive / search
{
filters : {
max_age ?: Duration ,
min_date ?: yyyy - mm - dd ,
max_date ?: yyyy - mm - dd ,
domain_whitelist ?: [ 'example.com' ],
domain_blacklist ?: [ 'example.com' ],
domain_regex_whitelist ?: [ '.*example..*' ],
domain_regex_blacklist ?: [ '.*example..*' ],
category_whitelist ?: [ 'Automotive' ],
category_blacklist ?: [ 'Automotive' ],
path_regex_whitelist ?: [ '.*/products/.*' ],
path_regex_blacklist ?: [ '.*/products/.*' ],
language_whitelist ?: [ 'eng' ], // ISO 639-3 letter language codes
language_blacklist ?: [ 'eng' ],
ip_country_whitelist ?: [ 'us' , 'ie' , 'in' ],
ip_country_blacklist ?: [ 'mx' , 'ae' , 'ca' ],
captcha ?: true ,
robots_block ?: true ,
}
}
POST api . brightdata . com / webarchive / search
{
filters : {
max_age ?: Duration ,
min_date ?: yyyy - mm - dd ,
max_date ?: yyyy - mm - dd ,
domain_whitelist ?: [ 'example.com' ],
domain_blacklist ?: [ 'example.com' ],
domain_regex_whitelist ?: [ '.*example..*' ],
domain_regex_blacklist ?: [ '.*example..*' ],
category_whitelist ?: [ 'Automotive' ],
category_blacklist ?: [ 'Automotive' ],
path_regex_whitelist ?: [ '.*/products/.*' ],
path_regex_blacklist ?: [ '.*/products/.*' ],
language_whitelist ?: [ 'eng' ], // ISO 639-3 letter language codes
language_blacklist ?: [ 'eng' ],
ip_country_whitelist ?: [ 'us' , 'ie' , 'in' ],
ip_country_blacklist ?: [ 'mx' , 'ae' , 'ca' ],
captcha ?: true ,
robots_block ?: true ,
}
}
curl -X POST https://api.brightdata.com/webarchive/search \
-H "Authorization: Bearer $API_KEY " \
-H 'Content-Type: application/json' \
--data '{"filters": {"max_age": "1d", "domain_whitelist": ["example.com"]}}'
Here is a brief explanation of each of the parameters you are able to use in your requests:
Parameter Description max_age Limits results to records collected within a specified time range. min_date Returns records collected on or after the specified date. max_date Returns records collected on or before the specified date. domain_whitelist Includes results only from listed domains. domain_blacklist Excludes results from listed domains. category_whitelist Includes results only from specified categories. category_blacklist Excludes results from specified categories. path_regex_whitelist Includes results only matching the specified path regex. path_regex_blacklist Excludes results matching the specified path regex. language_whitelist Includes results only for specific language codes (ISO 639-3). language_blacklist Excludes results for specific language codes. ip_country_whitelist Includes results collected through IPs or peers only from specified countries. ip_country_blacklist Excludes results collected through IPs or peers from specified countries. captcha Return only results with captcha triggered robots_block Return only results with robots block
You can run up to 100 searches per day without triggering a dump.
Once you trigger a dump, that search no longer count against your limit.
Get Search Status
To check the status of a specific query that was made.
Endpoint : GET api.brightdata.com/webarchive/search/<search_id>
When successful it will retrieve:
The number of entries for your query
The estimated size and cost of the full Data Snapshot
Request Response Code Example GET api . brightdata . com / webarchive / search /< search_id >
GET api . brightdata . com / webarchive / search /< search_id >
The status code of all three following responses is 200 OK
{
search_id : "ID" ,
status : "in_progress"
}
curl https://api.brightdata.com/webarchive/search/ $SEARCH_ID \
-H "Authorization: Bearer $API_KEY "
Get All Search Statuses
Check the status of all current searches.
Endpoint : GET api.brightdata.com/webarchive/searches
Request Response Code Example GET api . brightdat . com / webarchive / searches
GET api . brightdat . com / webarchive / searches
[
{
search_id: "ID" ,
status: "in_progress"
},
{
search_id: "ID" ,
status: "done"
},
// ... + rest of the searches and status
}
curl https://api.brightdata.com/webarchive/searches \
-H "Authorization: Bearer $API_KEY "
How data range affects delivery time
If your query is matching data within last 72h - your snapshot will start processing/delivering immediately.
If some of your matched data is older than 72h - it needs to be retrieved from a colder archive before delivery and it may take up to 72h .
We recommend using max_age
= 1d
for initial testing.
Deliver Snapshot to Amazon S3 Storage
To use S3 storage delivery, you will first need to do the following:
Create an AWS role which gives Bright Data access to your system.
During this setup, you will be asked by Amazon for an “external ID” that is used with the role.
Your external ID for S3 is your Bright Data Account ID that can be found within Account Settings
Once a role is created, you will need to allow our system delivery role to AssumeRole
that role.
Our system delivery role is: arn:aws:iam::422310177405:role/brd.ec2.zs-dca-delivery
To deliver a specific Snapshot from a specific search_id
to S3 storage, use the following /dump
endpoint.
Endpoint : POST api.brightdata.com/webarchive/dump
Request Response Code Example POST api . brightdata . com / webarchive / dump
{
search_id : < search_id > ,
max_entries?: 1000000, // (optional) limit how many files you purchase
delivery: {
strategy : 's3' ,
settings : {
bucket: < your_bucket_name > ,
assume_role: {
role_arn : < role_you_created_above > ,
},
},
},
}
POST api . brightdata . com / webarchive / dump
{
search_id : < search_id > ,
max_entries?: 1000000, // (optional) limit how many files you purchase
delivery: {
strategy : 's3' ,
settings : {
bucket: < your_bucket_name > ,
assume_role: {
role_arn : < role_you_created_above > ,
},
},
},
}
curl -X POST https://api.brightdata.com/webarchive/dump \
-H "Authorization: Bearer $API_KEY " \
-H 'Content-Type: application/json' \
--data @- << EOF
{
"search_id": " $SEARCH_ID ",
"max_entries": 1000000,
"delivery": {
"strategy": "s3",
"settings": {
"bucket": " $YOUR_BUCKET_NAME ",
"assume_role": {
"role_arn": " $YOUR_DELIVERY_ROLE "
}
}
}
}
EOF
Collect Snapshot via Webhook
Collect a Data Snapshot via webhook from a specific search_id
Endpoint : POST api.brightdata.com/webarchive/dump
Request Response Code Example {
search_id : < search_id > ,
max_entries?: 1000000,
delivery: {
strategy : 'webhook' ,
settings : {
url: string (),
auth? : string (), // will be added as an Authorization header
},
}
}
{
search_id : < search_id > ,
max_entries?: 1000000,
delivery: {
strategy : 'webhook' ,
settings : {
url: string (),
auth? : string (), // will be added as an Authorization header
},
}
}
curl -X POST https://api.brightdata.com/webarchive/dump \
-H "Authorization: Bearer $API_KEY " \
-H 'Content-Type: application/json' \
--data @- << EOF
{
"search_id": " $SEARCH_ID ",
"max_entries": 1000000,
"delivery": {
"strategy": "webhook",
"settings": {
"url": " $YOUR_WEBHOOK_URL "
}
}
}
EOF
If you’re running a linux/macos machine, you can simulate one of our delivery webhooks with the code on this page .
Get Status of Data Snapshot
Check the status of a specific Data Snapshot (dump) using the dump_id.
Endpoint : GET api.brightdata.com/webarchive/dump/<dump_id>
Request Response Code Example GET api . brightdata . com / webarchive / dump /< dump_id >
GET api . brightdata . com / webarchive / dump /< dump_id >
The status code of all three following responses is 200 OK
In progress
Finished
Failed
{
dump_id : < id > ,
status: 'in_progress',
batches_total: 130,
batches_uploaded: 29,
files_total: 1241241251,
estimate_finish: ISODate
}
curl https://api.brightdata.com/webarchive/dump/ $DUMP_ID \
-H "Authorization: Bearer $API_KEY "
Get the Status of all Data Snapshots
Endpoint : GET api.brightdata.com/webarchive/dumps
[
{
dump_id: 'ID' ,
status: 'in_progress' ,
batches_total: 130 ,
batches_uploaded: 29 ,
files_total: 1241241251 ,
estimate_finish: Date
},
{
dump_id: 'ID' ,
status: 'done' ,
batches_total: 130 ,
files_total: 1241241251 ,
files_uploaded: 2412515 ,
completed_at: Date
}
// ... rest of the dumps
]
[
{
dump_id: 'ID' ,
status: 'in_progress' ,
batches_total: 130 ,
batches_uploaded: 29 ,
files_total: 1241241251 ,
estimate_finish: Date
},
{
dump_id: 'ID' ,
status: 'done' ,
batches_total: 130 ,
files_total: 1241241251 ,
files_uploaded: 2412515 ,
completed_at: Date
}
// ... rest of the dumps
]
curl https://api.brightdata.com/webarchive/dumps \
-H "Authorization: Bearer $API_KEY "
High-level process flow diagram