Common Crawl URL index for August 2019 with Last-Modified timestamps

This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

[1] https://commoncrawl.org/blog/august-2019-crawl-archive-now-available

Data and Resources

Additional Info

Field Value
Contact point ht@inf.ed.ac.uk
Dataset privacy Public
Dataset access requirements
Landing Page https://doi.org/10.48550/arXiv.2404.09770
Creator
Tags
Publisher
Geographical coverage
Start of time period covered by this dataset
End of time period covered by this dataset
Theme / Category
Access rights
Conforms To
Documentation
Publishing frequency
Language
Other identifiers
Provenance Combines material from Common Crawl dataset CC-MAIN-2019-35 (see 'related dataset' metadata below): a) the columnar index and b) Last-Modified header values from those Response records having one in the WARC component
Qualified Attribution
Qualified Relation
Related resources
Release or publication Date
Sample distribution of the dataset
A related dataset from which this dataset is derived https://commoncrawl.org/blog/august-2019-crawl-archive-now-available
Minimum spatial separation resolvable in the dataset (measured in metres)
Minimum time period
Dataset type
the most recent date on which the dataset was changed or modified
Version
A description of the differences between this version and a previous version of this dataset
Activity that generated the dataset