Rethinking the Sitecore Serialization Format: Rainbow Preview, part 1

If you’ve worked with Sitecore in a team setting for any length of time, you’ve probably had to deal with item serialization. Item serialization is nearly a requirement to be effective when you need to share templates, renderings, placeholder settings, custom experience buttons, and all the other Sitecore items stored in the databases that are effectively development artifacts, as opposed to content. Being development artifacts, we want to keep them under source control so we can version them, develop feature branches with them, and deploy them using continuous integration to our shared Sitecore installations.

If you’ve dealt with Sitecore’s serialization format in a team environment for any length of time, you’ve probably started to realize some of its shortcomings. Because,

Let’s pick on multilist fields, for example. This is what a multilist looks like in SSF:

----field----
field: {E391B526-D0C5-439D-803E-17512EAE6222}
name: Allowed Controls
key: allowed controls
content-length: 194

{E11BDB3B-1436-4059-90F6-DE2EE52A4EB4}|{D9C54253-37FF-4D64-8894-5373D8799361}|{F118E540-CC75-4AA9-A62B-D6ED9E6F77E4}|{A813194F-32F4-4501-A430-6602ABF73535}|{2F4ADF0B-9633-4EE9-B339-8CA32E2C3293}

Now let’s imagine using this in a team environment. Alice creates a new rendering, and needs to add it to placeholder settings so it can be used in Experience Editor. Meanwhile on another branch, Bob adds a different rendering he’s made to the same placeholder settings. What happens? Merge conflict. And not a simple, easy to solve conflict: one where a very long single line has to be merged by hand - because merging is line oriented. On top of that, you must not forget to recalculate the content-length. Oh joy.

Then let’s take a look at the data that’s stored. Do we need key to load the field? Do we need name even? Nope, though having a name around makes it easier to understand - we just don’t need two. Then how about that the format is endline specific - don’t leave home in Git without the special .gitattributes to leave those alone, or the files won’t be read by the Sitecore APIs.

Here’s another gremlin about serialization: it saves fields that aren’t important to development artifacts that are version controlled. Yes, I’m talking about you __Revision, __Updated by and __Updated. Certain Sitecore tools - like say the template editor - cause a ton of item saves and they are not picky about if any actual data fields have changed. This means that if I add a Foo field to my Bar template, the last updated on Bar, Foo, and every one of Bar‘s fields gets changed. Even if there is an actual data change, and the data change is auto-mergeable, you can still get conflicts on the statistical fields. Welcome to annoying merge conflict city, folks!

Part I: The JSON era

I’ve been at this a while. Version 1 is largely on the junk pile. Why? JSON.

JSON seemed like an obvious candidate format: it’s quick to parse, mature, and easy as pie to implement with JSON.NET and a few POCOs. In fact, its performance is quite similar to ye olde content-length above. It was certainly a step up, and I was not the only person to think of this idea. I learned a lot and got many ideas from Robin’s prototype, not the least of which was that we should reformat field values to make merging easier.

Oddly enough it was the idea of field reformatting that made JSON an untenable format. The problem is that merging is line oriented - so we would want to reformat that multilist from the first example with one GUID per line such that merge tools could make sense of it. Well JSON does not allow literal newlines in data values, instead it uses the string literal \r and/or \n. Unhelpful - but it makes parsing really fast.

Part II: YAML Ain’t Markup Language

With JSON in the bin, I started poking around for formats that had already been created which would support our needs. YAML (the acronym is the title of this section) fit the bill nicely. The design goals of YAML are to be a more human readable superset of JSON for things like configuration files and object persistence. Allows multiple lines, human readable, allows lists and nesting - nice.

Well the downside is that because of the flexibility of the format, YAML parsers are on the order of 10-100x slower than JSON parsers. It was slooooow. But the good news is that the YAML-based serialization format I had designed was much, much simpler than the entire YAML specification. So I wrote my own reader and writer that supported only the subset of YAML that was necessary. It was fast. The format was easy to read and understand. It had the ability to add field formatters to make values mergeable (at present, for multilists and layout fields). I wanted to start using it pretty badly in real projects :)

So without further ado, here is the same item that we took the multilist from above, but in YAML:

---
ID: 38ddd69e-fb0a-4970-926e-dfb0e5b9a5e1
Parent: 68e4c671-797d-4a89-8fa6-775926f1381d
Template: 5c547d4e-7111-4995-95b0-6b561751bf2e
Path: /sitecore/layout/Placeholder Settings/reductio/ad/absurdum
SharedFields:
- ID: 7256bdab-1fd2-49dd-b205-cb4873d2917c
  # Placeholder Key
  Value: heading
- ID: e391b526-d0c5-439d-803e-17512eae6222
  # Allowed Controls
  Type: TreelistEx
  Value: |
    {E11BDB3B-1436-4059-90F6-DE2EE52A4EB4}
    {D9C54253-37FF-4D64-8894-5373D8799361}
    {F118E540-CC75-4AA9-A62B-D6ED9E6F77E4}
    {A813194F-32F4-4501-A430-6602ABF73535}
    {2F4ADF0B-9633-4EE9-B339-8CA32E2C3293}
Languages:
- Language: en
  Versions:
  - Version: 1
    Fields:
    - ID: 25bed78c-4957-4165-998a-ca1b52f67497
      # __Created
      Value: 20100310T143300
    - ID: 52807595-0f8f-4b20-8d2a-cb71d28c6103
      # __Owner
      Value: sitecore\admin
    - ID: 5dd74568-4d4b-44c1-b513-0af5f4cda34f
      # __Created by
      Value: sitecore\admin
    - ID: 87871ff5-1965-46d6-884f-01d6a0b9c4c1
      # Description
      Value: <p>The heading of a page, above any content renderings.</p>

Notice how the fields’ names - nonessential data - are in YAML comments. Present for humans to read, not necessary for the machine. The Allowed Controls TreelistEx field is also an example of the YAML multiline format, starting with the pipe and the data on a newline, indented further. YAML uses significant whitespace to define structure, making it easy to read and also not requiring hacks like content-length to efficiently parse.

You may notice that only the Allowed Controls field has a field type value. This is because the value was formatted with a FieldFormatter, so when deserializing the value it uses the type to figure out which formatter to “unformat” the value back into Sitecore with.

The way that languages are structured is also slightly different here: each language has a top level list item, but versions are all grouped hierarchically under their language. So in this format, language is a construct aside from “da-DK #1” like the Sitecore database uses.

We’ve also used field level ignores to not even store the constantly changing statistics fields (e.g. __Updated). This is optional, if you choose to do so, but it does lead to wonderfully compact files on disk.

The YAML format is also completely endline ignorant - it doesn’t care if it gets \n or \r\n.

Yes, please?

Well sorry - it’s not quite done yet, though the YAML part is pretty stable. The YAML format is a component of my upcoming Rainbow library. Rainbow is essentially a modernized serialization API that aggressively improves both the serialization format and the storage hierarchy, plus providing deep item comparison capabilities.

Rainbow aims to only be an API and has no default configuration or frontend. It will be freely available to use.

I have several projects that I intend to use Rainbow for - maybe you can think of some uses too?

  • Unicorn 3.0, being developed alongside it, will use Rainbow for storage, comparison, and formatting needs.
  • Rainbow for SPE will enable serializing and deserializing items in YAML using Sitecore Powershell Extensions

At present, Rainbow is alpha-quality. I wouldn’t use it unless you want to get your hands dirty, and I might change APIs as needed.

In the next post of this series, I’ll go over thinking behind improvements to the standard storage hierarchy to enable infinite depth while still being human readable. But there are still some bugs to fix before I write that ;)

Transparent media optimization with Dianoga 2

I’ve pushed a significant update to Dianoga, my framework to transparently optimize the size of media library images, including rescaled sizes.

My original post on Dianoga 1.0 explains why you’d want to use it in detail, but as an overview how does saving 5-40% overall on image bandwidth sound? All you have to do is install a NuGet package that adds one dll and one config file to your solution. That’s it. You can also find documentation on GitHub.

What’s new in 2.0

Dianoga 2 is faster and more extensible. The key change is that media is now optimized asynchronously, resulting in a near-zero impact on site performance even for the first hit. Dianoga 1.0 ran its optimization in the getMediaStream pipeline, which meant it had to complete the optimization while the browser was waiting for the image. For a large header banner, this could be a couple seconds on slower hardware (for the first hit only). Dianoga 2 replaces the MediaCache and performs asynchronous optimization after the image has been cached. This does mean that the first hit receives unoptimized images, but subsequent requests after the cache has been optimized get the smaller version.

Extensibility has also been improved in Dianoga 2. Previously, there were a lot of hard-coded things such as the path to the optimization tools and the optimizers themselves. Now you can add and remove optimizers in the config (including adding your own, if you wanted say a PNG quantizer), and move the tools around if desired.

There was a bug in Dianoga 1.0 that resulted in the tools DLL becoming locked, which could cause problems with deployment tools. This has been fixed in 2.0 by dynamically loading and unloading at need. Thanks to Markus Ullmark for the issue report.

Upgrade

Upgrading from Dianoga 1.0 to 2.0 is fairly simple; upgrade the NuGet package. You can choose to overwrite your Dianoga.config file (recommended), or merge it with the latest in GitHub if you don’t want to.

Have fun!

Sitecore Azure Role Sizing Guide

It seems like everyone is hot to get Sitecore on Azure these days, but there’s not a whole lot of guidance out there on how Sitecore scales effectively in an Azure environment. Since I get asked the “what size instances do we need?” or “how much will Azure cost us?” question a lot, I figured some testing was in order.

The environment I tested in is a real actual customer website on Sitecore 7.2 Update-4 (not yet launched but in late beta). The site has been relatively optimized with normal practices - namely, appropriate application of output caching and in-memory object caching. The site scores a B on WebPageTest - due to things out of my control - and is in decent shape. We are deploying to Azure Cloud Services (PaaS), using custom PowerShell scripting - not the Sitecore Azure module (we have had suboptimal experiences with the module). We’re using SQL Azure as the database backend, and the Azure Cache session provider. Note that the Azure Cache session is not supported for xDB (7.5/8) scenarios.

The Tests

I wanted to determine the effects of changing various cloud scaling options relative to real world performance, but I also didn’t want to spend days benchmarking. So I settled on two test scenarios:

  • Frontend performance. This is a JMeter test that hits the site with 10,000 requests to the home page, with 1,000 threads at once. The idea is to beat the server into submission and see how it does.
  • Backend performance. I chose Lucene search index rebuilding, as that is a taxing process on both the database and server CPUs, and it conveniently reports how long it took to run. Note that you should probably be using Solr for any xDB scenarios or any search-heavy sites.

These are not supposed to be ‘end of story’ benchmarks. There are no doubt many bones to pick with them, and they’re certainly not comprehensive. But they do paint a decent picture of overall scaling potential.

Testing Environment

The base testing environment consists of an Azure Cloud Service running two A3 (4-core, 7.5GB RAM, HDDs) web roles. The databases (core, master, web, analytics) are SQL Azure S3 instances running on a SQL Azure v12 (latest) server.

Several role variants were tested:

  • 2x A3 roles, which are 4-cores with 7GB RAM and HDD local storage (~$268/mo each)
  • 2x D2 roles, which are 2-cores (D-series cores are ‘60% faster’ per Microsoft) 7GB RAM and SSD local storage (~$254/mo each)
  • 2x D3 roles, 4-cores 14GB RAM and SSD local storage (~$509/mo each)
  • 2x D4 roles, 8-cores 28GB RAM and SSD local storage (~$1018/mo each)

In addition, I tested the effect of changing the SQL Azure sizes on the backend with the D4 roles. I did not test the frontend in these cases, because that seemed to scale well on the S3s already.

  • S3 databases, which are 100DTU “standard” (~$150/mo each)
  • P1 databases, which are 125DTU “premium” (~$465/mo each)
  • P2 databases, which are 250DTU “premium” (~$930/mo each)

Results

Our first set of tests are comparing the size of the web roles in Azure. It is quite likely that these results would also be loosely applicable to IaaS virtual machine deployments as well.

Throughput

The throughput graph is about as expected: add more cores, get more scaling. The one surprise is that the D2 instance, with 4 total cores at a higher clock and SSD disk, is able to match the A3 with 8 total cores (D2 and A3 cost a similar amount).

Keep in mind that all of these were using the same SQL Azure S3 databases as the backend - for general frontend service, especially with output caching, Sitecore is extremely database-efficient and does not bottleneck on lower grade databases even with 16 cores serving.

Note that I strongly suspect the bandwidth between myself and Azure was the bottleneck on the D4 results, as I saw it burst up to more like 500 during the main portion of the test.

Latency

Latency continues the interesting battle between A3 and D2. The A3 pulls out a minor - within margin of error - victory on average but the SSD of the D2 gives it a convincing performance consistency win with the 95% line over 1 second faster. Given that A3 and D2 cost a similar amount, it seems D2 would be a good choice for budget conscious hosting - but if you can afford Sitecore, you should probably stretch your licenses to more cores per instance.

The D3 is probably the ideal general purpose choice for your web roles.

The latency numbers on the monstrous 8-core D4 instances tell the tale that my 1,000 threads and puny 100Mb/s bandwidth were just making the servers laugh.

These graphs illustrate the latency consistency you gain from a D-series role with SSD:

These two graphs plot the latency over time during the JMeter test. You can see how the top one (A3) is much less consistent thanks to its HDDs, whereas the lower one (D3) is smoother, indicating fewer latency spikes. The latency decreases in the final stage of the test due to some JMeter threads having completed their 10-request run earlier than others, so less server load is in play.

Bandwidth

Bandwidth is another measure of how much data is being pushed out. Once again, the lesser gain than you might expect from D4 was due to my downstream connection becoming saturated as the monstrous servers laughed.

Backend: Rebuild all indexes

For this test, I rebuilt the default sitecore_core_index, sitecore_master_index, and sitecore_web_index indices.

Here we see the limitations of the S3 databases creep in during our D4 testing. Up to D3 seems to be limited by CPU speed, but the D4 is actually slower until we both up the max index rebuild parallelism and go to P2 databases to feed the monstrous 8-core beast.

Note that with SQL Azure, the charge is per-db - so if you wanted to kit out core, master, and web on P2s you’d be paying near $3,000/month just for the DBs. At that point it might be worth considering using IaaS SQL or looking at the Elastic Database Pools to spread around the bursty load of many databases.

Final Words

Overall Sitecore is a very CPU-intensive application where you can generally get away with saving a few bucks on the databases in favor of more compute, especially in content delivery. As in most applications, SSDs make a significant performance improvement to the point where I would suggest Azure D3 instances for nearly all normal Sitecore in Azure deployments, unless you need Real Ultimate Power from a D4 - or for the truly ridiculous a 16-core D14 ;)

For general frontend scaling the SQL Azure S3 seems to be the best price/performance until you go past quad cores, or past two web roles (one would assume 4x D3 would be relatively similar to 2x D4). Once you have 16 cores serving the frontend, you’ll be bottlenecked below P2 databases at least for index rebuilding. Publishing probably has similar characteristics.

As always when designing for scale, the first step should be to optimize your code and caching. For example, before I did any caching optimizations I did the same JMeter test on the site. The A3s only put out 40 requests/sec - after output caching and tweaks, they did 144 requests/sec. That’s not to say the site was slow pre-optimization: TTFB was about 220ms without load. But once you add load, even small optimizations make a huge difference.