Sunday, November 1, 2015

A Digital Odyssey: Sailing the Firefox HTTP Cache

The cache is one great resource of Firefox which brings life to the concept of offline browsing, allowing you to browse the pages you have already visited without reconnecting to the Internet. It does this by saving (caching) the content when it is fetched for the first time, indexed under the corresponding resource URI, so that it can be directly fetched from memory (RAM or disk) without downloading a fresh copy. (Well, other browsers do this as well; but in some, like Google Chrome, it's pretty hard for a mere mortal to use offline browsing, unlike in Firefox, IE or Opera.)

The cache gives you multiple advantages, the most prominent being that you do not have to hook up to the Internet whenever you want to visit a page, as long as the page (resource) has been visited previously within a reasonable time window. Even under normal (online) browsing, a resource would not be re-fetched if a valid (unexpired) copy already exists in the cache.

Traversing the Firefox cache is not a very common use case, but it has some interesting applications. One is full text search. Say, for instance, you want to find out that great article you looked up some day last week, but forgot to bookmark or save; you don't remember much, only that its body text (not the title or URL, unfortunately) contained the word "Android". Without going for a web search and skimming through dozens of irrelevant results in sequence, you can simply do a text search against the browser's cache which would return all articles containing the word (assuming it hasn't been cleared away by either an expiration or a crash). Sophisticated tools like CacheViewer are quite good at this.

However, if you wish to implement your own cache traversal piece of code (maybe for an add-on), you might easily get frustrated over your first few attempts. Unfortunately, Mozilla's detailed Firefox documentation does not seem to cover the matter adequately, especially after the cache management mechanisms were updated last year.

Cache traversal logic for Firefox is mostly asynchronous, implemented using the visitor design pattern. It requires that you obtain an instance of nsICacheStorageService via Components.classes["..."].getService() and invoke asyncVisitStorage on a diskCacheStorage or memoryCacheStorage retrieved through it, passing an instance of nsICacheStorageVisitor as a parameter. The instance should define an onCacheEntryInfo() event callback which is invoked whenever a cached resource entry is visited, and an onCacheEntryVisitCompleted() event callback that gets invoked when all visits have been completed.

onCacheEntryInfo() receives a parameter containing attributes of the resource visited (such as URI, size and last visited date) and a stream reference that can be used to read the resource content. Logic can be included to operate on these attributes on a per-resource basis (since the callbacks would be invoked independently). Normal JS tricks like closures can be used to accumulate such results, perhaps to be combined at the traversal completion event.

For example, the following snippet of code (adapted from my own "aggressive URL remapper for Firefox" here) will traverse the cache, looking for resources corresponding to partial URLs listed in f, and accumulate all matches into a separate array r to be processed when the traversal is completed:

cacheService = Components.classes["@mozilla.org/netwerk/cache-storage-service;1"]
	.getService(Components.interfaces.nsICacheStorageService);	//note the "netwerk"!
var {LoadContextInfo} = Components.utils.import("resource://gre/modules/LoadContextInfo.jsm",{});
cache = cacheService.diskCacheStorage(LoadContextInfo.default, true);
cache.asyncVisitStorage({
	f: [ /*list of URL segments to be searched*/ ],
	r: [ /*list of URL segments found*/ ],

	// this will run for each cache entry
	onCacheEntryInfo: function(entryInfo) {
		url = entryInfo.key;
		for(i in this.f) {
			if(url.indexOf(this.f[i]) > 0) {
				/* do your stuff */
				this.r.push(url);
			}
		}
	},

	// this will run when traversal is over
	onCacheEntryVisitCompleted: function() {
		diff = this.f.length - this.r.length;
		if(diff > 0) {
			alert('Warning: ' + diff + ' URLs missing');
		}
		/* process this.r here */
	}
}, true);

Please note, however, that traversing a large cache may hang your browser for quite some time, as a huge amount of operations may be carried out in the process. If your callback methods tend to use up too much memory (e.g. by accumulating data in-memory), it's even possible that the browser may crash altogether. So beware!

Some of the interface definitions used for composing the above code was obtained from the mozilla/newtab-dev GitHub repository.

No comments: