Common Crawl URL index for August 2019 with Last-Modified timestamps

This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.


Data and Resources

Additional Info

Field Value
Contact point
Dataset privacy Public
Dataset access requirements
Landing Page
Geographical coverage
Start of time period covered by this dataset
End of time period covered by this dataset
Theme / Category
Access rights
Conforms To
Publishing frequency
Other identifiers
Provenance Combines material from Common Crawl dataset CC-MAIN-2019-35 (see 'related dataset' metadata below): a) the columnar index and b) Last-Modified header values from those Response records having one in the WARC component
Qualified Attribution
Qualified Relation
Related resources
Release or publication Date
Sample distribution of the dataset
A related dataset from which this dataset is derived
Minimum spatial separation resolvable in the dataset (measured in metres)
Minimum time period
Dataset type
the most recent date on which the dataset was changed or modified
A description of the differences between this version and a previous version of this dataset
Activity that generated the dataset