Tuesday, October 15, 2019

JAR File Handles: Clean Up After Your Mess!

In Ultra ESB we use a special hot-swap classloader that allows us to reload Java classes on demand. This allows us to literally hot-swap our deployment units - load, unload, reload with updated classes, and phase-out gracefully - without restarting the JVM.

Hot swap, baby!

Windows: supporting the forbidden land

In Ultra ESB Legacy the loader was working fine on Windows, but on the newer X-version it seemed to be having some hiccups. We are not supporting Windows as a target platform, so it didn't matter much - until recently, when we decided to support non-production distros on Windows. (Our enterprise integration IDE UltraStudio runs fine on Windows, so Windows devs, you are all covered.)

AdroitLogic UltraStudio - Your Enterprise Integration IDE

TDD FTW

Fixing the classloader was a breeze, and all tests were passing; but I wanted to back my fixes up with some extra tests, so I wrote a few new ones. Most of these involved creating a new JAR file in a subditrectory under the system temp directory, and using the hot-swap classloader to load different artifacts that were placed inside the JAR. For extra credit on best practices, I also made sure to add some cleanup logic to delete the temp subdirectory via FileUtils.deleteDirectory().

And then, things went nuts.

And the tear-down was no more.

All tests were passing, in both Linux and Windows; but the final tear-down logic was failing in Windows, right at the point where I delete the temp subdirectory.

Being on Windows, I didn't have the luxury of lsof; fortunately, Sysinternals already had just the thing I needed: handle64.

Finding the culprit was pretty easy: hit a breakpoint in tearDown() just before the directory tree deletion call, and run a handle64 {my-jar-name}.jar.

Bummer.

My test Java process, was holding a handle to the test JAR file.

Hunting for the leak

No. Seriously. I didn't.

No. Seriously. I didn't.

Naturally, my first suspect was the classloader itself. I spent almost half an hour going over the classloader codebase again and again. No luck. Everything seemed rock solid.

The "leak dumper"; a.k.a my Grim Reaper for file handles

My best shot was to see what piece of code had opened the handler to the JAR file. So I wrote a quick-n-dirty patch to Java's FileInputStream and FilterInputStream that would dump acquire-time stacktrace snapshots; whenever a thread holds a stream open for too long.

This "leak dumper" was partly inspired by our JDBC connection pool that detects unreleased connections (subject to a grace period) and then dumps the stacktrace of the thread that borrowed it - back at the time it was borrowed. (Kudos to Sachini, my former colleague-intern at AdroitLogic.)

The leak, exposed!

Sure enough, the stacktrace revealed the culprit:

id: 174 created: 1570560438355
--filter--

  java.io.FilterInputStream.<init>(FilterInputStream.java:13)
  java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:81)
  java.util.zip.ZipFile$ZipFileInflaterInputStream.<init>(ZipFile.java:408)
  java.util.zip.ZipFile.getInputStream(ZipFile.java:389)
  java.util.jar.JarFile.getInputStream(JarFile.java:447)
  sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:162)
  java.net.URL.openStream(URL.java:1045)
  org.adroitlogic.x.base.util.HotSwapClassLoader.loadSwappableClass(HotSwapClassLoader.java:175)
  org.adroitlogic.x.base.util.HotSwapClassLoader.loadClass(HotSwapClassLoader.java:110)
  org.adroitlogic.x.base.util.HotSwapClassLoaderTest.testServiceLoader(HotSwapClassLoaderTest.java:128)
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  java.lang.reflect.Method.invoke(Method.java:498)
  org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:86)
  org.testng.internal.Invoker.invokeMethod(Invoker.java:643)
  org.testng.internal.Invoker.invokeTestMethod(Invoker.java:820)
  org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1128)
  org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
  org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
  org.testng.TestRunner.privateRun(TestRunner.java:782)
  org.testng.TestRunner.run(TestRunner.java:632)
  org.testng.SuiteRunner.runTest(SuiteRunner.java:366)
  org.testng.SuiteRunner.runSequentially(SuiteRunner.java:361)
  org.testng.SuiteRunner.privateRun(SuiteRunner.java:319)
  org.testng.SuiteRunner.run(SuiteRunner.java:268)
  org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
  org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
  org.testng.TestNG.runSuitesSequentially(TestNG.java:1244)
  org.testng.TestNG.runSuitesLocally(TestNG.java:1169)
  org.testng.TestNG.run(TestNG.java:1064)
  org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72)
  org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:123)

Gotcha!

  java.io.FilterInputStream.<init>(FilterInputStream.java:13)
  ...
  sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:162)
  java.net.URL.openStream(URL.java:1045)
  org.adroitlogic.x.base.util.HotSwapClassLoader.loadSwappableClass(HotSwapClassLoader.java:175)

But still, that didn't tell the whole story. If URL.openStream() opens the JAR, why does it not get closed when we return from the try-with-resources block?

        try (InputStream is = jarURI.toURL().openStream()) {
            byte[] bytes = IOUtils.toByteArray(is);
            Class<?> clazz = defineClass(className, bytes, 0, bytes.length);
            ...
            logger.trace(15, "Loaded class {} as a swappable class", className);
            return clazz;

        } catch (IOException e) {
            logger.warn(16, "Class {} located as a swappable class, but couldn't be loaded due to : {}, " +
                    "trying to load the class as a usual class", className, e.getMessage());
            ...
        }

Into the wild: JarURLConnection, URLConnection, and beyond

Thanks to Sun Microsystems who made it OSS, I could browse through the JDK source, right up to this shocking comment - all the way down, in java.net.URLConnection:

    private static boolean defaultUseCaches = true;

   /**
     * If <code>true</code>, the protocol is allowed to use caching
     * whenever it can. If <code>false</code>, the protocol must always
     * try to get a fresh copy of the object.
     * <p>
     * This field is set by the <code>setUseCaches</code> method. Its
     * value is returned by the <code>getUseCaches</code> method.
     * <p>
     * Its default value is the value given in the last invocation of the
     * <code>setDefaultUseCaches</code> method.
     *
     * @see     java.net.URLConnection#setUseCaches(boolean)
     * @see     java.net.URLConnection#getUseCaches()
     * @see     java.net.URLConnection#setDefaultUseCaches(boolean)
     */
    protected boolean useCaches = defaultUseCaches;

Yep, Java does cache JAR streams!

From sun.net.www.protocol.jar.JarURLConnection:

    class JarURLInputStream extends FilterInputStream {
        JarURLInputStream(InputStream var2) {
            super(var2);
        }

        public void close() throws IOException {
            try {
                super.close();
            } finally {
                if (!JarURLConnection.this.getUseCaches()) {
                    JarURLConnection.this.jarFile.close();
                }
            }

        }
    }

Gotcha!

If (well, because) useCaches is true by default, we're in for a big surprise!

Let Java cache its JARs, but don't break my test!

JAR caching would probably improve performance; but does that mean I should stop cleaning up after - and leave behind stray files after each test?

(Of course I could say file.deleteOnExit(); but since I was dealing with a directory hierarchy, there was no guarantee that things would get deleted in order, and undeleted directories would be left behind.)

So I wanted a way to clean up the JAR cache - or at least purge just my JAR entry; after I am done, but before the JVM shuts down.

Disabling JAR caching altogether - probably not a good idea!

URLConnection does offer an option to avoid caching connection entries:

   /**
     * Sets the default value of the <code>useCaches</code> field to the
     * specified value.
     *
     * @param   defaultusecaches   the new value.
     * @see     #getDefaultUseCaches()
     */
    public void setDefaultUseCaches(boolean defaultusecaches) {
        defaultUseCaches = defaultusecaches;
    }

It would have been perfect if caching could be disabled per file/URL, as above; our classloader caches all entries as soon as it opens a JAR, so it never needs to open/read that file again. However, once a JAR is open, caching cannot be disabled on it; so once our classloader has opened the JAR, there's no getting rid of the cached file handle - until the JVM itself shuts down!

URLConnection also allows you to disable caching by default for all subsequent connections:

   /**
     * Sets the default value of the <code>useCaches</code> field to the
     * specified value.
     *
     * @param   defaultusecaches   the new value.
     * @see     #getDefaultUseCaches()
     */
    public void setDefaultUseCaches(boolean defaultusecaches) {
        defaultUseCaches = defaultusecaches;
    }

However, if you disable it once, the whole JVM could be affected from that moment onwards - since it probably applies to all URLConnection-based implementations. As I said before, that could hinder performance - not to mention deviating my test from cache-enabled, real-world behavior.

Down the rabbit hole (again!): purging manually from the JarFileFactory

The least-invasive option is to remove my own JAR from the cache, when I know I'm done.

And good news, the cache - sun.net.www.protocol.jar.JarFileFactory - already has a close(JarFile) method that does the job.

But sadly, the cache class is package-private; meaning there's no way to manipulate it from within my test code.

Reflection to the rescue!

Reflection it is.

Thanks to reflection, all I needed was one little "bridge" that would access and invoke jarFactory.close(jarFile) on behalf of me:

class JarBridge {

    static void closeJar(URL url) throws Exception {

        // JarFileFactory jarFactory = JarFileFactory.getInstance();
        Class<?> jarFactoryClazz = Class.forName("sun.net.www.protocol.jar.JarFileFactory");
        Method getInstance = jarFactoryClazz.getMethod("getInstance");
        getInstance.setAccessible(true);
        Object jarFactory = getInstance.invoke(jarFactoryClazz);

        // JarFile jarFile = jarFactory.get(url);
        Method get = jarFactoryClazz.getMethod("get", URL.class);
        get.setAccessible(true);
        Object jarFile = get.invoke(jarFactory, url);

        // jarFactory.close(jarFile);
        Method close = jarFactoryClazz.getMethod("close", JarFile.class);
        close.setAccessible(true);
        //noinspection JavaReflectionInvocation
        close.invoke(jarFactory, jarFile);

        // jarFile.close();
        ((JarFile) jarFile).close();
    }
}

And in my test, I just have to say:

        JarBridge.closeJar(jarPath.toUri().toURL());

Right before deleting the temp directory.

Yay!

So, what's the take-away?

Nothing much for you, if you are not directly dealing with JAR files; but if you are, you might run into this kind of obscure "file in use" errors. (That would hold true for other URLConnection-based streams as well.)

If you happen to be as (un)lucky as I was, just recall that some notorious blogger had written some hacky "leak dumper" patch JAR that would show you exactly where your JAR (or non-JAR) leak is.

Adieu!

Friday, October 4, 2019

Is your JVM leaking file descriptors - like mine?

Okay, there's definitely a leak somewhere.

Foreword: The two issues described here, were discovered and fixed more than a year ago. This article only serves as historical proof, and a beginners' guide on tackling file descriptor leaks in Java.

In Ultra ESB we use an in-memory RAM disk file cache for fast and garbage-free payload handling. Some time back, we faced an issue on our shared SaaS AS2 Gateway where this cache was leaking file descriptors over time. Eventually leading to too many open files errors when the system ulimit was hit.

The Legion of the Bouncy Castle: leftovers from your stream-backed MIME parts?

One culprit, we found, was Bouncy Castle - the famous security provider that had been our profound love since the Ultra ESB Legacy days.

Bouncy Castle FTW!

With some simple tooling we found that BC had the habit of calling getContent() on MIME parts in order to determine their type (say, instanceof checks). True, this wasn't a crime in itself; but most of our MIME parts were file-backed, with a file-cache file on the other end - meaning that every getContent() opens a new stream to the file. So now there are stray streams (and hence file descriptors) pointing to our file cache.

Enough of these, and we would exhaust the file descriptor quota allocated to the Ultra ESB (Java) process.

Solution? Make 'em lazy!

We didn't want to mess with the BC codebase. So we found a simple solution: create all file-backed MIME parts with "lazy" streams. Our (former) colleague Rajind wrote a LazyFileInputStream - inspired by LazyInputStream from jboss-vfs - that opens the actual file only when a read is attempted.

Yeah, lazy.

BC was happy, and so was the file cache; but we were the happiest.

Hibernate JPA: cleaning up after supper, a.k.a closing consumed streams

Another bug we spotted was that some database operations were leaving behind unclosed file handles. Apparently this was only when we were feeding stream-backed blobs to Hibernate, where the streams were often coming from file cache entries.

Hibernate: hassle-free ORM, but with leaks?

After some digging, we came up with a theory that Hibernate was not closing the underlying streams of these blob entries. (It made sense because the java.sql.Blob interface does not expose any methods that Hibernate could use to manipulate the underlying data sources.) This was a problem, though, because the discarded streams (and the associated file handles) would not get released until the next GC.

This would have been fine for a short-term app, but a long-running one like ours could easily run out of file descriptors; such as in case of a sudden and persistent spike.

Solution? Make 'em self-closing!

We didn't want to lose the benefits of streaming, but we didn't have control over our streams either. You might say we should have placed our streams in auto-closeable constructs (say, try-with-resources). Nice try; but sadly, Hibernate was reading them outside of our execution scope (especially in @Transactional flows). As soon as we started closing the streams within our code scope, our database operations started to fail miserably - screaming "stream already closed!".

When in Rome, do as Romans do, they say.

So, instead of messing with Hibernate, we decided we would take care of the streams ourselves.

Rajind (yeah, him again) hacked together a SelfClosingInputStream wrapper. This would keep track of the amount of data read from the underlying stream, and close it up as soon as the last byte was read.

Self-closing. Takes care of itself!

(We did consider using existing options like AutoCloseInputStream from Apache commons-io; but it occurred that we needed some customizations here and there - like detailed trace logging.)

Bottom line

When it comes to resource management in Java, it is quite easy to over-focus on memory and CPU (processing), and forget about the rest. But virtual resources - like ephemeral ports and per-process file descriptors - can be just as important, if not more.

Especially on long-running processes like our AS2 Gateway SaaS application, they can literally become silent killers.

You can detect this type of "leaks" in two main ways:

  • "single-cycle" resource analysis: run a single, complete processing cycle, comparing resource usage before and after
  • long-term monitoring: continuously recording and analyzing resource metrics to identify trends and anomalies

In any case, fixing the leak is not too difficult; once you have a clear picture of what you are dealing with.

Good luck with hunting down your resource-hog d(a)emons!