CDK Shorts #1 – Consistent asset hashing (NodeJS)

The first CDK Short is dedicated to how the AWS CDK handles asset hashing.

Myself and many others have seen intermittent “issues” with assets such as Lambda code or S3 Bucket content being non-deterministic; uploaded on every deploy, even if the source does not change. I decided to finally take the time and do a thorough investigation. This is essentially a report of my findings and how to get back to deterministic deployments.

Originally, I thought that upgrading the CDK to a new version would solve my problems seeing that I was on an OLD CDK version so I went from v.1.32.0 to v1.94.1 🙃. Problem solved – just kidding, I wish.

You are welcome to skip to the TL;DR at the end, proceed if you want to go down the rabbit hole..

Scenario

  1. From my local machine, when I do a `cdk deploy` then immediately after it, do another `cdk deploy` yields in no assets being deployed the second time, aka deterministic.
  2. From AWS CodeBuild, when it does a build that executes `cdk deploy` and then immediately after it, retrying that build (meaning source did not change). Then the second `cdk deploy` does deploy the assets again, aka non-deterministic.

I did some digging on GitHub and looked at the CDK assets code. It looked pretty solid, I could see previous issues with ordering of files/folders and timestamps but that was all sorted now and has tests to back it up. So the problem can and is not with how CDK determines the asset hash. The problem is what I/you point the CDK at.

The problem

Using scenario 2 to debug further, I downloaded both of my Lambda zip asset files from the AWS Console after the CodeBuild deployments. I abandoned inspecting the contents by hand after 5 minutes and wrote a small script using the folder-hash package.

The culprit turned out be inside the node_modules directory, actually not just one, but EVERY single module that contained a package.json file. Below a diff of one of these packages 😮

<Click to enlarge>

Taking a deeper look into the contents of one of the node_modules packages:


<Click to enlarge>
Comparison of a node_modue package.json file with different _where property.

So each package, after installation, writes metadata (fields starting with underscore, specifically _where) into their own package.json of where they are located. Now AWS CodeBuild uses a clean random directory for every build. To recap on the events that transpire:

  1. AWS CodeBuild runs build in new directory
  2. We do our build process (npm install …. )
  3. CDK detects the hash is different because each package adds useless metadata to its package.json. Then proceeds to upload the new assets.

I then resorted to using another package removeNPMAbsolutePaths in my build step to remove all of those useless metadata properties in every package.json file within node_modules. Redoing scenario 2 still yielded the similar results, actually I could see that a lot less assets was detected as “changed”. So this partially solved my problem and for the majority of people will be all that is needed.

This non-deterministic metadata behavior of npm adding metadata to the package.json is not an issue to some. But it is for those that generate a hash after doing a npm install on build systems like CodeDeploy and then handing the assets over to the CDK. As evident by this issue https://github.com/npm/npm/issues/10393 that has been open for some time now. The cherry on top of the cake is that those metadata fields are reserved for future use, so they aren’t even being used.

Investigation continued for the few assets that was still generating different hashes, even after removing the useless metadata from the package.json files. One of my Lambdas was using html-minifier which had a dependency on the he package. After installation this package decides to add an extra property to its package.json file. It adds the path to the `man` file, which again is a directory dependent value.

Other non-deterministic properties written to the package.json after installation.

At this point I came to the conclusion that the only solution is to remove every package.json file within the node_modules directory after doing a build. Or similarly, supply my own hash to the CDK that excludes the contents of the node_modules for every asset that it touches. I went with the latter as it just felt like the cleaner of the two options.

There are of course other solutions, like just doing the hash on the source before doing the build process. That wouldn’t have worked for me as I was copying files into the Lambda directories during build time, specifically transpiling MJML email templates to HTML.

Solution

In my build process I use the folder-hash package. This time iterating a list of each CDK asset directory, generating a hash of its contents which excludes the node_moduels folder. Then storing a json file which contains the path of the directory and the associated hash. In my CDK code, I look the hash up by the asset path and manually assign the hash.

Part of the build process to generate the hash json file.
Part of the CDK code, having read the hash json file (generated above), reading a hash indexed by directory.

I believe this is one of many solutions, another might be to use local bundling . I have not tested this but think that the CDK might generate the hash from the contents before the local bundle function runs. Alternatively doing docker bundling should, probably, also work as the build process (npm install…) happens in the same directory every time. I haven’t explored either of these as I am happy with my method of manually specifying the hash.

This piece of knowledge isn’t important for only backend assets. This might also influence your frontend build process (depending on framework), forcing new content on S3 and a CloudFront invalidation that might not be needed at all. It might also apply to other languages and frameworks.

TL;DR

It is the wild-wild west within the node_modules directory, it mutates after installation and is the cause for non-deterministic hashing.

The npm system, after installation, adds useless metadata fields (starting with underscore, specifically _where ) to each modules package.json, of which some are directory dependent. On top of that every module can add their own properties to the package.json. It is these directory dependent properties that are the root cause of non-deterministic hashes and deployments.