Common Crawl URL index for August 2019 with Last-Modified timestamps

This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

[1] https://commoncrawl.org/blog/august-2019-crawl-archive-now-available

Data and Resources

Additional Info

Field Value
Contact point ht@inf.ed.ac.uk
Dataset privacy Public
Landing Page https://doi.org/10.48550/arXiv.2404.09770
Provenance Combines material from Common Crawl dataset CC-MAIN-2019-35 (see 'related dataset' metadata below): a) the columnar index and b) Last-Modified header values from those Response records having one in the WARC component
A related dataset from which this dataset is derived https://commoncrawl.org/blog/august-2019-crawl-archive-now-available