Saturday, November 7, 2015

A Word About Using External Web Resources

Today one can make a surprisingly attractive and responsive with a couple hundred lines of code. This is really cool; but one should not forget the fact that most of the hard work is being done by the plethora of libraries (JS) and style definitions (CSS) that he (sometimes unconsciously) incorporates into his code.

Content developer networks (CDNs) publicly serve static web resources (e.g. jQuery and Bootstrap libraries) for public use in other web sites. This has multiple advantages:

  • Users do not have to download the resources from scratch whenever they visit a new web site, as soon as the resources remain cached in their browsers. (CDN-based resources are often deliberately served with long expiry times, to encompass this requirement.)
  • Resources are standardized across web sites, leading to an "unconscious agreement" among different developers regarding usage of external resources.

Unfortunately, many developers have got accustomed to the habit of using their own versions of static web resources, hosted within the respective site domains, rather than including them from CDN sources. This is mainly done in favour of convenience in development, as developers can cut down the bandwidth usage and latency of over-the-wire CDN resource fetches, and stick to the same set of resource versions for all development work.

An even worse approach is observed in many other sites, even some highly recognized ones like some Google Apps domains, where the URL for the same resource is dynamically changed to force a redownload of a fresh copy, regardless of it already being present in the local browser cache. For example, the URI for the resource http://www.example.com/static/style.css may be http://www.example.com/static/style.css?_=xxxxxx where xxxxxx is something like a timestamp, version number or a randomly generated string.

Visiting several such sites results in the browser having to retain multiple copies of the same resource in memory (or disk), despite their content being identical, since it has to honor the possibility that distinct URIs may correspond to distinct content at any time.

Another trick used by websites is the mutation of the URL itself, rather than the query parameters, using hashes and other means. This is typically seen among resource URLs used in sites like Facebook, Google Apps (Groups, Apps Script etc.), LinkedIn and eBay. For example, the URL we demonstrated above may be transformed into http://www.example.com/static/style-xxxxxx.css where xxxxxx may correspond to some alphanumeric string (which may, at times, contain dashes and other symbols as well).

As all the abovementioned malpractices lead to increased latency and bandwidth usage on the user's side, some better design approaches would be to:

  • use public CDN resource URLs whenever possible
  • While this may be disadvantageous from the developer's side (since page loading would get delayed the moment you switch from local to remote resources), it can be minimized by devising a scheme to use local resources during development, and switching everything to remote on deployment.

  • avoid using dynamic or parametric URIs (e.g. with timestamps) for supposedly static resources which might get updated less frequently
  • HTTP provides techniques like expiration headers (Expires (response), If-Modified-Since (request) etc.) and ETags for propagating updates to client side. Although these can be tricky to master, they can turn out to be quite efficient in terms of cache and bandwidth usage when it comes to static resources.

  • set up the expiration headers of your web server properly
  • This is essential for avoiding frequent cache entry expirations resulting in redownloads of the same resource, straining the client (browser) as well as the server.

Know Your Limits: Check File Size Before You Download!

At times it may be necessary for you to check the size of a downloadable file before actually downloading it. Although many well-designed websites display file sizes alongside their URLs, there are ample cases where there is no such indication. An ignorant mouse click may cost you a significant quantity of data before you realize that the download simply cannot proceed under the current data package quota, or that you have simply picked a wrong download link.

While browsers like Firefox can be configured to display confirmation dialogs acknowledging file size before the download starts progressing, this is generally an illusion because modern browsers generally start downloading the file in advance while the confirmation dialog is still in the foreground.

Fortunately, in most cases you may be able to use a HTTP tool (like wget or curl) to check the size of the associated file in advance before actually initiating the content download. This works not only for files, but for most other kinds of resources as well.

Here's how you can use wget to check the size of a download without initiating it. It uses the HTTP HEAD method to restrict the server response to headers, avoiding the actual payload (download content).

wget -S --method=HEAD -O - your_url_goes_here

-S is for asking wget to output the server response, while -O writes the output to a file, with - indicating that the output file is standard output (in our case, the terminal).

On a Linux machine with wget version 1.15, this would provide an output similar to what follows:

$ wget -S --method=HEAD -O - http://jflex.de/release/jflex-1.6.1.tar.gz
Resolving jflex.de (jflex.de)... 65.19.178.144
Connecting to jflex.de (jflex.de)|65.19.178.144|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Sat, 07 Nov 2015 16:05:43 GMT
  Server: Apache/2.4.7 (Ubuntu)
  Last-Modified: Sat, 11 Apr 2015 02:44:19 GMT
  ETag: "2e334f-51369db8566c0"
  Accept-Ranges: bytes
  Content-Length: 3027791
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: application/x-gzip
Length: 3027791 (2.9M) [application/x-gzip]
Remote file exists.

As seen above the download size is indicated by the Content-Length header. Unless packet transmission errors happen during the download process, this would be the amount of data to be consumed in the download process (plus a small margin for headers and other lower-level synchronization and acknowledgement signals).

Unfortunately some servers may not provide the Length header, in which case the value would either appear as unspecified or not appear at all. In such cases an attempt via a browser would produce the same result. As of now I haven't been able to find a workaround for this issue.

Sunday, November 1, 2015

A Digital Odyssey: Sailing the Firefox HTTP Cache

The cache is one great resource of Firefox which brings life to the concept of offline browsing, allowing you to browse the pages you have already visited without reconnecting to the Internet. It does this by saving (caching) the content when it is fetched for the first time, indexed under the corresponding resource URI, so that it can be directly fetched from memory (RAM or disk) without downloading a fresh copy. (Well, other browsers do this as well; but in some, like Google Chrome, it's pretty hard for a mere mortal to use offline browsing, unlike in Firefox, IE or Opera.)

The cache gives you multiple advantages, the most prominent being that you do not have to hook up to the Internet whenever you want to visit a page, as long as the page (resource) has been visited previously within a reasonable time window. Even under normal (online) browsing, a resource would not be re-fetched if a valid (unexpired) copy already exists in the cache.

Traversing the Firefox cache is not a very common use case, but it has some interesting applications. One is full text search. Say, for instance, you want to find out that great article you looked up some day last week, but forgot to bookmark or save; you don't remember much, only that its body text (not the title or URL, unfortunately) contained the word "Android". Without going for a web search and skimming through dozens of irrelevant results in sequence, you can simply do a text search against the browser's cache which would return all articles containing the word (assuming it hasn't been cleared away by either an expiration or a crash). Sophisticated tools like CacheViewer are quite good at this.

However, if you wish to implement your own cache traversal piece of code (maybe for an add-on), you might easily get frustrated over your first few attempts. Unfortunately, Mozilla's detailed Firefox documentation does not seem to cover the matter adequately, especially after the cache management mechanisms were updated last year.

Cache traversal logic for Firefox is mostly asynchronous, implemented using the visitor design pattern. It requires that you obtain an instance of nsICacheStorageService via Components.classes["..."].getService() and invoke asyncVisitStorage on a diskCacheStorage or memoryCacheStorage retrieved through it, passing an instance of nsICacheStorageVisitor as a parameter. The instance should define an onCacheEntryInfo() event callback which is invoked whenever a cached resource entry is visited, and an onCacheEntryVisitCompleted() event callback that gets invoked when all visits have been completed.

onCacheEntryInfo() receives a parameter containing attributes of the resource visited (such as URI, size and last visited date) and a stream reference that can be used to read the resource content. Logic can be included to operate on these attributes on a per-resource basis (since the callbacks would be invoked independently). Normal JS tricks like closures can be used to accumulate such results, perhaps to be combined at the traversal completion event.

For example, the following snippet of code (adapted from my own "aggressive URL remapper for Firefox" here) will traverse the cache, looking for resources corresponding to partial URLs listed in f, and accumulate all matches into a separate array r to be processed when the traversal is completed:

cacheService = Components.classes["@mozilla.org/netwerk/cache-storage-service;1"]
	.getService(Components.interfaces.nsICacheStorageService);	//note the "netwerk"!
var {LoadContextInfo} = Components.utils.import("resource://gre/modules/LoadContextInfo.jsm",{});
cache = cacheService.diskCacheStorage(LoadContextInfo.default, true);
cache.asyncVisitStorage({
	f: [ /*list of URL segments to be searched*/ ],
	r: [ /*list of URL segments found*/ ],

	// this will run for each cache entry
	onCacheEntryInfo: function(entryInfo) {
		url = entryInfo.key;
		for(i in this.f) {
			if(url.indexOf(this.f[i]) > 0) {
				/* do your stuff */
				this.r.push(url);
			}
		}
	},

	// this will run when traversal is over
	onCacheEntryVisitCompleted: function() {
		diff = this.f.length - this.r.length;
		if(diff > 0) {
			alert('Warning: ' + diff + ' URLs missing');
		}
		/* process this.r here */
	}
}, true);

Please note, however, that traversing a large cache may hang your browser for quite some time, as a huge amount of operations may be carried out in the process. If your callback methods tend to use up too much memory (e.g. by accumulating data in-memory), it's even possible that the browser may crash altogether. So beware!

Some of the interface definitions used for composing the above code was obtained from the mozilla/newtab-dev GitHub repository.