Monitoring webpages with Last-Modified and ETag headers


Nitin Venkatesh's Gravatar

Nitin Venkatesh
published July 14, 2014, midnight


Most sites have a lot going on in their HTTP headers which can sometimes help us easily identify if the web page has changed since the last time we visited it. This ofcourse depends on if the server responds with these headers and has been configured correctly to reflect any changes in the headers.

Enough said, so when we query a webpage, the server responds with a few headers. For example,

$ curl -I www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
HTTP/1.1 200 OK
Date: Sat, 12 Jul 2014 17:19:27 GMT
Server: Apache/2
Last-Modified: Wed, 01 Sep 2004 13:24:52 GMT
ETag: "1edec-3e3073913b100"
Accept-Ranges: bytes
Content-Length: 126444
Cache-Control: max-age=21600
Expires: Sat, 12 Jul 2014 23:19:27 GMT
P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Content-Type: text/html; charset=iso-8859-1

(the -I switch in curl displays only the headers returned instead of the resource itself)

If you're wondering which page I'm querying, it's the W3C page dealing with HTTP Header Definitions. Why this page? It's the first page that I came across that responds with both the headers we're going to talk about in this post.

The two HTTP headers we're interested in are the Last-Modified and the ETag headers.

Last-Modified: Wed, 01 Sep 2004 13:24:52 GMT
ETag: "1edec-3e3073913b100"

1. Monitoring for changes with Last-Modified

We make a request to the page with a If-Modified-Since header holding a date. Note that the date should be in the same specific format as that in the response of the server. We typically copy the Last-Modified information given by the server and use that in our If-Modified-Since header.

$ curl -I --header 'If-Modified-Since: Wed, 01 Sep 2004 13:24:52 GMT'  www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

HTTP/1.1 304 Not Modified
Date: Sat, 12 Jul 2014 17:29:35 GMT
Server: Apache/2
ETag: "1edec-3e3073913b100"
Expires: Sat, 12 Jul 2014 23:29:35 GMT
Cache-Control: max-age=21600

Notice the first line of the response from the server, it says 304 Not Modified, so obviously this page hasn't changed.

2. Monitoring for changes with Etag

We make a request to the page with a If-None-Match header holding the data in the ETag header of the initial server response. Note that if the ETag includes " or ' your header's value should include that as well.

$ curl -I --header 'If-None-Match: "1edec-3e3073913b100"' www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

HTTP/1.1 304 Not Modified
Date: Sat, 12 Jul 2014 17:34:23 GMT
Server: Apache/2
ETag: "1edec-3e3073913b100"
Expires: Sat, 12 Jul 2014 23:34:23 GMT
Cache-Control: max-age=21600

The 304 Not Modified response shows that the page hasn't been modified. If the page were to change, the ETag would update as well.

Note: If the server response includes a ' or " be sure to include that in your header as well, otherwise you'll get an unexpected result. For example,

$ curl -I --header 'If-None-Match: 1edec-3e3073913b100' www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

HTTP/1.1 200 OK
Date: Sat, 12 Jul 2014 17:37:26 GMT
Server: Apache/2
Last-Modified: Wed, 01 Sep 2004 13:24:52 GMT
ETag: "1edec-3e3073913b100"
Accept-Ranges: bytes
Content-Length: 126444
Cache-Control: max-age=21600
Expires: Sat, 12 Jul 2014 23:37:26 GMT
P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Content-Type: text/html; charset=iso-8859-1

I forgot to wrap the header value with " as in the server response and hence the value of the header was misread and it outputs the result as it should if the page were modified.