Unicorn 3.0.2 Released

Unicorn 3.0.2 (and Rainbow 1.0.1) have been released to NuGet.

This release includes some critical fixes to the serialization system, and upgrading is highly recommended for all users of Unicorn 3.

Improvements

  • Unicorn auto-publish now reports its activities to the sync console, and operates synchronously to guarantee published state for automated builds.
  • Guidance on some Transparent Sync pitfalls has been added to the wiki.
  • The new Rainbow.SFS.ExtraInvalidFilenameCharacters setting enables TFS users to successfully serialize branch templates which begin in $, a reserved character by TFS. The Rainbow.TFS.config.example file is now shipped with Rainbow to illustrate its usage for TFS.
  • An example configuration file to demonstrate how to move your serialized items elsewhere has been added.

Fixes

  • CRITICAL Saving item versions that were not in the default Sitecore language was not updating the serialized item. This has been fixed. #76
  • CRITICAL Serialization of significant empty field values, such as unchecked checkboxes when the standard value is checked, has been fixed. #74
  • Items that are restored from the Sitecore recycle bin or archive to Unicorn-controlled locations now result in the item being serialized on restore. #72
  • Resetting a field to standard values now correctly results in the field value being removed from the serialized item. #74
  • Query string order is always respected when syncing multiple configurations at once using the control panel.
  • The serialization conflict dialog (if disk doesn’t match base item data on save) now formats correctly using HTML for Sitecore 8+
  • Using the Dump Item and Dump Tree commands in Sitecore to reserialize partial item trees now correctly serializes an item to all configurations that it may belong to, instead of just the first one.

Upgrading

Upgrading from Unicorn 3.x is as simple as updating your NuGet packages. If it prompts to overwrite any changed config files, I’d suggest doing so and using a diff with your source control to resolve differences.

If you suspect that you have serialized items that may be missing significant empty values, you should reserialize those after upgrading to get their values normalized.

The Credits

Unicorn isn’t just my project. Thanks a lot to all the community members who reported bugs, sent pull requests, and diagnosed issues to help make Unicorn better.

Unicorn 3.0.2 is brought to you by…

Robin Hermanussen
Thomas Eldblom
Nathanael Mann
@GeorgeKarbyn
Kevin Williams
@amitthakur2014
@WizX20
@kamsar
…and the rest of #unicorn on Sitecore Slack

As usual, if you run into trouble you can submit an issue on GitHub, hit me up on Sitecore Slack, or tweet at me.

Unicorn 3.0.1 Released

The first maintenance release for Unicorn 3.x has been released to NuGet. Unicorn 3.0.1 brings with it a small set of bug fixes and improvements:

  • Data provider config patch is compatible with Sitecore Commerce and other modules that add data providers (thanks @jonnekats!)
  • Fixed a bug in Transparent Sync where syncing an item based on a template that only existed in transparent sync (and not the database) would act like the template did not exist. Ironically this was fixed in 3.0.0 but some n00b forgot to merge the config to activate it! (thanks Kevin Williams!)
  • Fixed a race condition that could cause SFS trees to incorrectly reinitialize during a reserialize operation and cause an error (thanks @akshaysura13!)
  • Content editor warnings have been improved:
    • The name of the Unicorn configuration that includes the item is now shown so it’s easy to trace back why something is included.
    • For transparent sync items, there is an indication of whether the item is on disk only or both on disk and in the database as well. Note that items in both places may well have different field values, there is no equality check.

Due to the fix to Transparent Sync, this update is recommended for anyone using that feature.

Have fun!

Unicorn 3.0 Released

patrick you're a bloody genius

It’s been over a year since the last major release of Unicorn, but I haven’t been bored. Today I’m happy to announce the release of Unicorn 3.0 and Rainbow 1.0.

Unicorn 101

Sitecore development artifacts are both code and database items, such as rendering code and rendering items. As developers, we use serialization to write our database artifacts into source control along with our code so that we have a record of all our development artifacts. Unicorn is a tool to make serializing items and sharing them across teams and deployed environments easy and fun.

Unicorn takes advantage of keeping serialized items always up to date on the disk, which allows all merging to take place using your source control tool. Modern source control tools such as Git or SVN are capable of automatically merging the vast majority of item changes, which saves you time and reduces the chance of human error when merging.

Unicorn 3 is fresh off the compiler and brings with it a huge raft of improvements. This version brings its own serialization format and filesystem hierarchy that are far more friendly to source control and merge conflicts than the default Sitecore format. It’s also ridiculously fast - about 50% faster overall than Unicorn 2 or Sitecore serialization APIs.

What else is new?

In fact there are lots more new features added to Unicorn 3 since the what’s new blog post was written. Not just minor details either! Read on…

Transparent Sync

This is a feature that demanded its own blog post! Transparent Sync enables Unicorn to sync the serialized items in real time, completely automatically. It does this by using its data provider to directly read the serialized items and inject them into the Sitecore content tree. The items on disk are the items in Sitecore: it bypasses the database entirely for transparently synced items.

Seriously, it’s magic. Transparent Sync might be the best feature of Unicorn 3. You should give it a try!

New Configuration Architecture

Previously Unicorn was distributed with its default example configuration directly in Unicorn.config. This was suboptimal because it encouraged modifying the stock config file and making future upgrades more of a merging challenge than they needed to be. In the new order, Unicorn ships without any configurations defined. There are two example configuration-adding patch files distributed with the NuGet package which you can duplicate and edit to your desires, and the README distributed with the NuGet package details what you need to do to get started.

Control Panel UX Improvements

The control panel has been redesigned to have an improved UX when there are many configurations defined. In the old UI, once more than a couple configurations were defined the page would need to scroll. The new UI reduces the vertical space for each configuration significantly:

The new control panel also has an experience for selecting only the configurations you wish to sync in a single batch:

For automated builds, you can also copy the URL on the sync button when you’ve got multiple configurations selected in the control panel to run the same batch at build time. This is great if you’ve got some configurations that you may not wish to sync during a CI build.

New Items Only Evaluator

This is an evaluator that changes the behavior of syncing to only write new items from serialization into Sitecore. Existing items with the same ID or items that are not serialized are left alone. This can be useful for example if you’re wanting to push some developer-initiated content items up, like metadata or lookup source items, that you want to always exist but once they have been created they become the content editor’s to do with as they will. A sample configuration that enables the NIO evaluator ships with the NuGet package. Thanks to Nathanael Mann for the feature request.

Exclude all children with the predicate

To exclude all children of an item simply add a trailing slash to the <exclude> like so:

<include database="master" path="/sitecore/content">
    <exclude path="/sitecore/content/" />
</include>

Visual Studio Control Panel Support

Recently a plugin for Visual Studio was released that enables you to run Unicorn syncs directly within Visual Studio. Unicorn 3 supports this tool out of the box, by simply enabling the Unicorn.Remote.config.disabled patch file.

The Unicorn Control Panel for Visual Studio requires VS2013 or VS2015. It is developed by Andrii Snihyr - thank you for the community support!

TFS Support

Fellow Sitecore MVP and Connective DX’er Dave Peterson has written up a plugin for Rainbow/Unicorn that integrates it with TFS. TFS 2010 (or server workspaces in 2012 and later) have the unfortunate limitation that all files are locked by default. This interferes with Unicorn’s operation as it writes files as items are changed in Sitecore, resulting in errors.

The TFS integration hooks the TFS API to Rainbow’s SFS data store, causing it to actually check out the file Unicorn is about to write or delete before acting. You must use a 32-bit IIS app pool when using the plugin, as the TFS API is 32-bit only.

The TFS plugin is available on NuGet; for installation instructions and the latest info see the README on GitHub

Documentation

The documentation in the README, comments in the stock config files, and verbiage in the actual control panel and logs have all been reviewed and improved. If you get confused by anything, let me know and I’ll make the docs better or the error message clearer in the next release.

Thank you

Unicorn is a project by and for the Sitecore community. I’d like to thank everyone who’s contributed to the project in a major or minor way: without you, we wouldn’t have this tool today. Thank you!

Introducing Transparent Sync in Unicorn 3

A couple years ago Alex Shyba told me about an idea he had: what if we used a data provider to map serialized items directly into the content tree, bypassing the database and eliminating the need to sync? I was like

I went nuts and wrote a prototype that did exactly that.

The prototype, nicknamed Rhino because it has a prominent horn like certain other mythical beasts, actually worked fairly well. Unfortunately there are two hard problems in computer science: naming, cache invalidation, and off by one errors. Cache invalidation, specifically using the FileSystemWatcher to observe file changes by source control, was unreliable. Because of how core serialization is to Sitecore development practice, unreliability is not acceptable. Reluctantly, I shelved Rhino and worked on Unicorn 2 instead.

unicorn logo

The idea of Rhino stuck around. The improvements brought around in Rainbow, such as partial item reading and tighter control around storage, enabled working around the limitations that had precluded Rhino from being useful in production. Thus the idea returns as Transparent Sync, which might just be the best part of Unicorn 3.

Transparent Sync enables Unicorn to sync serialized items in real time, completely automatically. It does this by using its data provider to directly read the serialized items and inject them into the Sitecore content tree. The items on disk are the items in Sitecore: it bypasses the database entirely for transparently synced items. Changes made on disk update nearly instantly in the Sitecore editing interfaces.

Imagine a scenario where a development team is working with a feature-branch driven workflow, like GitHub Flow. In order to perform a code review when using Transparent Sync, you merely checkout the feature branch under review and your Sitecore is immediately configured with the items that the feature includes.

t always do code reviews

When should we use Transparent Sync?

Transparent Sync is turned off by default because you should understand how it works before enabling it.

Transparent Sync is excellent for development artifacts like templates and rendering items, but it’s inappropriate if you’ve got hundreds of thousands of content items. At startup Unicorn must build an index of metadata, which involves reading the headers of each serialized file. If you have a SSD this penalty is pretty minimal, but a traditional hard drive not so much. In testing I enabled transparent sync for the whole default core and master database (19,228 items) using a SSD and noted about 100ms increase in startup times on average.

Because Transparent Sync bypasses the normal sync process, transparent sync also bypasses anything that is hooked to sync. This would include things like custom evaluators (like NewItemsOnlyEvaluator) and the sync event pipelines. If you are relying on these customizations, turn transparent sync off for the configurations that use them.

How do we use Transparent Sync?

Note: you must perform an initial serialization of a configuration with Transparent Sync off before you enable it. Otherwise the items in the configuration will seem to disappear as transparent sync shows all zero items that are on disk!

Turning transparent sync on is really easy: take the <configuration> you want to add transparent sync to and put this line in it:

<dataProviderConfiguration enableTransparentSync="true" type="Unicorn.Data.DataProvider.DefaultUnicornDataProviderConfiguration, Unicorn" />

You can also change the setting in the global <defaults> if you want to change the default setting of transparent sync for all configurations.

Once Transparent Sync is enabled all you have to do is change items on disk and the updates will immediately appear within Sitecore.

Going to production

In production we usually remove the Unicorn data provider as it’s not normally required. However without the Unicorn data provider, transparently synced items disappear: they aren’t really in the database at all, they’re on disk. There are two approaches to solve this problem:

  • Transparent Synced configurations can sync in the traditional way. This will persist the disk-based items permanently in the database.
  • Keep the data provider enabled in production. This is appealing because it means you can just deploy files to production and be done with it - and rollback in the same way. Be aware that if the IIS app pool identity cannot write to the serialized items you will be unable to make any ‘emergency’ changes to the items in Sitecore.

What happens if I reserialize a Transparent Sync configuration?

The database is used for all reserialize operations. In a Transparent Sync configuration, it is common for items to NOT reside in the database at all. If you reserialize this configuration it will be reset to what is in the database, thus deleting any Transparent Sync items that are not already in the database. Similarly if you use ‘Dump Item’ to do a partial reserialize on a Transparent Sync item it will be reverted to what is in the database, which may well be ‘nothing’.

There are warnings in the control panel if you reserialize a Transparent Sync configuration.

Anything else?

I hope you have as much fun with Transparent Sync as I had making it!

Unicorn 3: What's new?

Now that I’ve finished being confusing with my last three posts about Rainbow, let’s talk about Unicorn 3.

Unicorn 3 is the product of a year of thought and implementation. The design goal was nothing less than fixing every annoyance or issue that we ran across in daily usage of Unicorn 2. So what’s new?

New serialization format

High on the list of annoyances with daily Unicorn usage was the difficulty of resolving merge conflicts in the Sitecore serialization format.

The only fix for this was to make a better format: human readable, easily mergeable, and fixing the daily annoyances of the Sitecore format. Because no tool is an island, the new serialization tools Unicorn 3 uses have been split off into their own library: Rainbow. Rainbow enables others to use the new serialization tools developed for Unicorn 3, without depending on Unicorn.

The details of the new format and why it exists can be found in part 1 of my Rainbow blog series.

New file organization

Alongside designing the new serialization format, another problem was to resolve the longstanding limitations and bugs in the way Sitecore’s built in serialization stores its file hierarchy. This was worth a whole blog post in and of itself, but in summary I think those limitations have been eliminated.

The new hierarchy also alters the way the storage tree operates. Whereas Sitecore’s model represents an entire database, Unicorn 3 represents a set of subtrees of the database. Depending on the depth of the root paths, this can result in much shorter filesystem paths and fewer over-length path loops.

Improved user experience

Improved console messaging

Unicorn 3 has greatly improved messaging, both reducing extra output that isn’t necessary as well as vastly better error formatting. This hilarity was a constant annoyance if something went wrong in Unicorn 2 (in this case, invalid serialization format):

Unicorn 3’s version of the same error:

Editor warnings for Unicorn-controlled items

Another common issue is having someone edit an item that Unicorn controls on a deployed server, and having their changes overwritten by the next automated deployment. People asked for a way they could know if an item was “Unicorned” or not. Well, now there is one:

The message shown changes depending on the environment as well; this is how it looks when “production mode” is on:

And also, if production mode is enabled if you attempt to save the item:

Unicorn-enabled serialize commands

Ever used these handy tools on the by-default-hidden Developer tab of the Sitecore ribbon?

Guess what happens if you use these commands on an item Unicorn 3 controls? You guessed it: you can do partial syncing and partial reserialization using these commands. In Unicorn terms, there is no difference between “update” and “revert”: both just mean “sync.”

Note: Using the commands on non-Unicorn items results in their default behaviour. Also “update database” and “revert database” do not interface with Unicorn at all, and perform their default actions.

Performance: 50% more of it

Unicorn 3 has had significant performance profiling applied to it, and is about 10-20% faster than Unicorn 2 or stock Sitecore serialization. On top of that, multithreading is utilized to further increase sync and reserialize speed. Between threading and optimizations, it’s generally about 50% faster than Unicorn 2. For maximum performance, using SSD storage is highly recommended as it has a major impact on sync speed.

Auto-publish synced Items

Unicorn 3 enables efficient auto-publishing of items that a sync has modified, using manual publish queue injection. This option is enabled by default by the Unicorn.AutoPublish.config file. You can remove this file to disable the feature if you don’t like it.

Deleted template field handling

In Unicorn 2, deleting template fields could cause havoc. If any other serialized item contained a value for the deleted field, and you failed to reserialize that item after deleting the field, your next sync would be greeted with a big ugly error and you’d have to go manually remove the offending field from the file.

Unicorn 3 fixes this issue, by making this error a warning instead. (We can’t just ignore it, because ignoring it would cause syncs that created template fields as well as values for that field to not load the values the first time)

Versioned to Shared field conversion

Converting a field between shared, versioned, and unversioned by syncing a template was not supported in Unicorn 2. The field itself would change, but the existing values for the field would not move to the correct database table in Sitecore. Unicorn 3 includes the necessary adaptation to migrate stored values when field sharing is changed by a sync.

Improved configuration scheme

Unicorn 2 used a single configuration file: Serialization.config. This was a bit confusing as parts of it were ideally removed when deployed to a CE or CD environment and that was not always obvious.

Version 3 addresses this situation by creating its own App_Config\Unicorn folder and splitting the configuration into files that can simply be deleted instead of modified. This folder contains:

  • Unicorn.config: Contains core configuration, e.g. predicates and dependency setup, can be left intact anywhere.
  • Unicorn.DataProvider.config: Attaches the Unicorn Data Provider, which handles automatic serialization of changed items to disk. Can be removed for any environment not collecting changes. (usually, any non-dev environment)
  • Unicorn.UI.config: Adds the Unicorn Control Panel, Partial Sync, editor warnings, and other end-user UI elements. Remove for CD environments.
  • Unicorn.AutoPublish.config: Causes Unicorn to automatically publish items that it changes during a sync. Remove if you don’t want this feature, or on CD environments.
  • Unicorn.Deployed.config.disabled: If this config is enabled, typically on a deployed CE instance, saving an item added to Unicorn results in a warning confirmation.

Each of these config files also has comments explaining the above as well as what the settings within do.

Technical Improvements

Unicorn 3 also includes numerous miscellaneous technical improvements.

  • No third party dependencies. (that means you, Ninject)
  • Improved default predicate configuration settings account for Sitecore 8 additions and common modules.
  • Added SerializationHelper, a handy class that lets you more easily invoke Unicorn operations such as Sync without a lot of setup, if you want to programmatically use Unicorn.
  • Added pipelines to hook to sync events. unicornSyncBegin occurs when a configuration sync begins. unicornSyncComplete occurs when a configuration sync ends. unicornSyncEnd occurs at the end of all syncs. (e.g. after all configurations have synced if batching > 1)
  • Thanks to Rainbow, the object model is unified (ALL items are IItemData both from Sitecore and serialized). Previously each had their own parallel models.

System Requirements

Unicorn 3 has been developed on Sitecore 7.2 Update-5, as well as Sitecore 8 Update-5. It should be compatible with all version of Sitecore 7 and 8.

Unicorn 3’s code should be easy to adapt to Sitecore 6.2-6.6, however there is no official support for it so you’d want to compile from source.

Ok, ok. Is it released yet?

No. But it is in an actively updated beta on NuGet that is fairly stable :)

Rainbow Part 3: What is Rainbow?

It may seem a bit weird that in part 3 of a series, we’re talking about what something is.

After all, shouldn’t that be part 1? Nope, this is iterative development :)

So, what is Rainbow? Rainbow is designed to be a complete replacement for the Sitecore serialization format and filesystem organization, as well as enabling cross-source item comparison. It is a pure code library that comes with no UI of any kind, that is designed to be used with other libraries that use its serialization services with their own UIs. Libraries that consume Rainbow - such as Unicorn - gain the ability to abstract themselves from serialization details. Libraries that extend Rainbow can add new serialization formats, new places to store serialized items, and new ways to organize them.

taste it

Rainbow Features

Universal Item Data and Data Stores

Rainbow implements a set of interfaces that wrap the structure of a Sitecore item - item, version, and field. These interfaces provide a universal language that all Rainbow data stores can implement against. You could get an IItemData from Sitecore and write it out to disk as a YAML formatted item. You could get an IItemData from a web service, and deserialize it into a Sitecore database. You could construct an IItemData programmatically, and serialize it to a Sitecore database. Implementations of IDataStore provide places to store item data. It’s completely universal, and everything Rainbow does revolves around these abstractions.

YAML-based serialization formatter improves the storage format for serialized items

  • This post goes into more detail about the hows and whys of the YAML serializer
  • The format is valid YAML. YAML is a language that is designed to store object graphs in a human readable fashion. It uses significant whitespace and indentation to denote data boundaries. Note: only a subset of the YAML spec is allowed for performance reasons.
  • Any type of endline support. Yes, even \r because one of the default Sitecore database items uses that in its text! No more .gitattributes needed.
  • No more Content-Length on fields that requires manual recalculation after merge conflicts
  • Multilists are stored multi-line so fewer conflicts can occur
  • XML fields (layout, rules) are stored pretty-printed so that fewer conflicts can occur
  • Customize how fields are stored when serialized yourself with Field Formatters

Serialization File System (SFS) storage hierarchy

  • This post goes into more detail about SFS
  • Human readable file hierarchy
  • Extremely long item name support
  • Unlimited path length support
  • Supports non-unique paths (two items under the same parent with the same name), while keeping it human-readable
  • Stores each included subtree in its own hierarchy, reducing file name length
  • You can plug in the serialization formatter you desire - such as the YAML provider - to format items how you want in the tree

Deserialize abstract items into Sitecore with the Sitecore storage provider

  • Turn Sitecore items into IItemData with the ItemData class
  • Deserialize an IItemData instance into a Sitecore item with DefaultDeserializer (used via SitecoreDataStore)
  • Query the Sitecore tree in abstract with SitecoreDataStore, as if it were any other serialization store

Item comparison APIs

  • Compare any two IItemData instances regardless of source
  • Customize comparison for field types or specific fields with Field Comparers
  • Get a complete readout of changes as an object model

Improvements

Improvements are in comparison to Sitecore serialization and the functionality in Unicorn 2.

  • Deleting template fields will no longer cause errors on deserialization for items that are serialized with a value for the deleted field (e.g. standard values)
  • Deserialization knows how to properly change field sharing or field versioning and port existing data to the new sharing setting

Rainbow Organization

Rainbow consists of several projects:

  • The core Rainbow project contains core interfaces and components that most all serialization components might need, for example SFS, item comparison tools, field formatters, and item filtering
  • Rainbow.Storage.Yaml implements serializing and deserializing items using a YAML-based format. This format is ridiculously easier to read and merge than standard Sitecore serialization format, lacking any content length attributes, supporting any type of newline characters, and supporting pretty-printing field values (e.g. multilists, layout) for simpler merging when conflicts occur.
  • Rainbow.Storage.Sc implements a data store using the Sitecore database. This can be used to read and write items to Sitecore using the IDataStore interface.

Extending Rainbow

Rainbow is designed to be loosely coupled. It’s recommended that you employ a Dependency Injection framework (e.g. SimpleInjector) to make your life constructing Rainbow objects easier. Of course you can also construct objects without DI.

Rainbow has extensive unit and integration tests and hopefully easy to understand code, which serve as its living documentation. As of now, Rainbow has 90% test coverage and 220 tests. Unicorn, which uses Rainbow as a service, can also be a useful source of examples.

If you can’t find what you’re looking for you can find me on Twitter (@kamsar) or on Sitecore Community Slack.

Next time, we’ll start talking about what’s new in Unicorn 3 - other than all the Rainbow things that come along because Unicorn uses Rainbow.

Reinventing the Serialization File System: Rainbow Preview, Part 2

This is the second post in a series about Rainbow, an advanced serialization library for Sitecore. Part 1, dealing with improving the serialization file format, can be found here.

Introducing Serialization File System

Rainbow supports the idea of a data store, which is an abstraction of the necessary components to store and retrieve Sitecore items. A data store need not be serialized: Sitecore’s database is itself implemented as a data store as far as Rainbow is concerned.

This time we’ll be talking about the Serialization File System data store. Serialization File System, or SFS for short, is a pattern for organizing files on disk to represent a Sitecore item tree. It’s only a pattern: it depends on a serialization formatter, such as the YAML formatter, to do the actual serialization and deserialization. This means that you can use whatever format you want, without having to reimplement the whole organizational structure. Or if you can’t stand SFS you can make your own data store and keep the YAML format :)

Why do we need SFS?

To understand why we need SFS, let’s get down to data structures for a minute here. Windows’ file system is essentially a B-tree. Sitecore’s content tree is also essentially a B-tree. So we can map the two together pretty easily, right? WRONG.

Path Length

Sitecore’s content tree is effectively infinite (by definitions of “infinite” that mean “up to 20 levels by default”) in depth. The Win32 filesystem APIs on the other hand, have a maximum path length of 240 characters. So now we run into a problem where you can have a path in Sitecore that is unrepresentable on the filesystem, because the content path is too long. Whoops. Sitecore’s serialization APIs handle this situation, albeit a bit clunkily. They take the item path and apply a trivial hash algorithm to it, then put the item at the root of the serialization tree in a folder named that hash. For example (slightly simplified):

  • given /sitecore/foo/bar/baz/quux/quince (imagine this is longer)
  • baz might be written to c:\serialization\master\sitecore\foo\bar\baz.item
  • but quux might go to c:\serialization\A8CF73\quux.item
  • and quince might go to c:\serialization\C732F1\quince.item - not even a child of quux

This solution works, but it also means that the hierarchy becomes nearly unintelligible once short-paths start being used. The hash is not a standard algorithm; there is no obvious way to see what it means without reading the .item file within to see what its path is. It also makes merging very unintuitive if the short-pathed items are changed because the file path tells you nearly nothing.

So now we’ve seen path length problems, but what about a special case of that? What happens if you create an item name that is so long that it alone becomes too long to fit in the Windows filesystem path limits? Imagine an item with a 300-character name. Well if this occurs - as can pretty easily with Web Forms for Marketers - the Sitecore serialization APIs all choke. Unicorn 2 fails because it uses the Sitecore APIs. Pretty sure TDS does too, but I could be wrong.

Filenames are not unique

Sitecore’s content tree, and the APIs to retrieve items “by path,” are quite misleading. Why? On the Windows filesystem the file name is a unique key. However under the covers of Sitecore the item’s ID is the unique key. Duplicate names are totally allowed - which means that getting something “by path” is horribly ambiguous (.5% of the time, but still). But this is the basis upon which the Sitecore standard serialization hierarchy is built: ambiguity.

But they made the right choice. Do you want to merge conflicts in items named by ID on disk? How about deal with that you’d hit the path length limit in about 3 levels with 35+ character file names? Thought not. But this means we have a problem to solve: how do we map non-unique nodes in the database (the item name) onto unique nodes (file names) on the file system.

Sitecore’s serialization APIs solve this in a somewhat decent fashion. When an item is serialized the parent paths are evaluated for items of the same name, and if one exists then the item path has the parent ID appended to it. For example:

  • given two items with path /sitecore/foo
  • when you write them to disk they would get a path like c:\serialization\master\sitecore\foo_d0ec0aa931eb46ecb241d1ca18b4c5b2.item where each item has its ID postfixed to the filename

Seems legit, right? Unfortunately it’s very broken and can actually corrupt your serialization tree. Don’t believe me? Try this:

  • given an item with path /sitecore/foo, serialize it
  • now we have c:\serialization\master\sitecore\foo.item
  • next we create another foo item with a different ID
  • now we’ll have foo.item, and foo_d0ec0aa931eb46ecb241d1ca18b4c5b2.item in the same folder
  • so far, so good. but now serialize the original foo item again.
  • now we have foo.item with old data, foo_d0ec0aa931eb46ecb241d1ca18b4c5b2.item, AND foo_2195e766591d4baf8ac63a1efa43526d.item
  • yep. two items on disk for the original foo.item, and a corrupted tree. Bad.

Pathing bugs in the API

The Sitecore serialization pathing APIs are unfortunately pretty buggy. There are several methods that assume you’d never serialize an item anywhere outside the default serialization folder - and in fact throw an error if you try it. These methods are all static, and thus the only way to change their behavior is by decompiling them and using your own fixed copy of them. That’s not in the least suboptimal, and it’s quite intentional that Rainbow has zero dependency on the Sitecore serialization APIs.

Tired of hearing me rant about bugs? Me too, how about we talk about solving these problems instead!

How does SFS work

The SFS data store is capable of storing practically infinite item path depths, as long of an item name as you please, handling duplicate filenames in all cases, and writing to any path you please. SFS is based on the idea of a ‘solid’ tree, where every node with children must also contain the serialized parent, for example:

  • given /sitecore as the root
  • you could serialize /sitecore/foo
  • but you could not serialize /sitecore/foo/bar without also serializing foo

Sitecore serialization does allow for ‘sparse’ trees where you can have unserialized parents. The astute may be reading this and asking “wtf, do we have to serialize the whole database then?”

s silly. Too bad this isn

SFS supports relative trees instead of a single monolithic tree that represents an entire database like Sitecore uses. For example:

  • given /sitecore/templates/User Defined as the root of a tree
  • User Defined might be serialized as c:\rainbows\User Defined\User Defined.yml

See how Rainbow ignores the relative Sitecore path? This means for deep tree roots your filesystem path length can be much shorter because it doesn’t need empty parents, resulting in fewer over-length file paths. It also means that you can browse to the items with fewer clicks in Windows Explorer. Unicorn 3 also allows you to name your trees (which in Unicorn terms map to an <include> entry on your predicate), so your serialization folder might resemble:

 c:\rainbow
     Templates
         User Defined.yml
     Funny Giphys
         Animations.yml
     Old School Content\
         Home.yml
         Home
             About us.yml

Note how the tree root folder name need not match the root item name (though it does by default, but it’s pretty easy to have duplicate names doing that). In the above case the c:\rainbow\templates item might be rooted in Sitecore at /sitecore/templates/User Defined. Note: the use of solid trees does preclude some kinds of exclusions, namely anything not path-based because other exclusions could result in a sparse tree.

Long Path Handling

SFS handles long content paths by using loopback paths. These are similar in concept to Sitecore’s hashed paths, but unlike hash-paths they actually transplant any children of that path under the loopback path. The loopback paths are also named by the item ID of the parent of the items in the loopback. Let’s look at an example (with a contrived very short max path length):

  • Given c:\rainbow\root\some\rather\long\path\length\thing\parent.yml (ID: 2195e766-591d-4baf-8ac6-3a1efa43526d)
  • Suppose that adding 3 characters to the end of “parent” makes the child path over-max-length. So if we add loopchild under parent,
  • c:\rainbow\root\2195e766591d4baf8ac63a1efa43526d\loopchild.yml becomes the child’s path, based on the parent item’s ID
  • If we add quux under loopchild, it goes to c:\rainbow\root\2195e766591d4baf8ac63a1efa43526d\loopchild\quux.yml - under the same loopback, adding a level of human readability hash-paths lack

Loopback paths may loop multiple times for extremely long path lengths. Loopbacks also handle the case where some children are short enough to live under the parent and others’ name puts it over the limit into a loopback.

Duplicate File Name Handling

If you have items of the same name under the same parent in Sitecore, SFS uses a similar approach to Sitecore’s APIs with item_id.yml as the filename format. However SFS does a lot of correctness verification that Sitecore does not, because SFS generates the paths based on the filesystem and not Sitecore. For example, suppose you had two items of the same name and then added different children to each. SFS will actually resolve the path by evaluating down the filesystem paths until it finds the matching path regardless of parentage - and it returns all matches instead of whichever one it feels like. The case of writing items at different times is also handled; SFS checks existing same named items for any with the same ID and reuses the name. So you never get renames for no reason or corrupted trees.

This may sound like a lot of file reading. Yes, it can cause more file reads than Sitecore’s approach. However with a smart path to ID cache, and the ability to read item metadata (which in the case of YAML items means only reading the first 4 lines of the file, which is 3x or more faster than reading the whole thing), both dumping and syncing items from SFS is up to 50% faster than Unicorn 2 on the same items.

Long Item Name Handling

Obscenely long item names are handled by simply truncating them before putting them on the filesystem. A setting controls the maximum name length. Parent items are properly disambiguated using ID suffixes if two differently named items truncate to the same short name. I serialized a whole Sitecore database with max filename length = 5 to test this. So many duplicate names then!

Reliability

SFS (and the YAML formatting pieces we talked about last time) has 95% code coverage, and all bugs that have been found so far are covered with additional tests.

Hey, what about an index file to speed up querying?

In fact Rainbow was originally designed to be a general purpose data store, where you could directly query an item by ID, path, template ID, or parent ID. It had in memory indexes that it would maintain, backed variously by a single global index file, and reading the headers of each serialized file. The indexes worked pretty well, in fact. But they had several major problems that caused me to scrap them:

  • There’s no good way to maintain an index cache, because you may not be the only writer to the index file (e.g. you git pull someone else’s changes). FileSystemWatcher is not 100% reliable, and that would be a necessity to avoid data corruption, which is a big deal when you’re talking about serialization.
  • A centralized index file precludes the possibility of easily copying items between trees because the index entries would have to go with it
  • A filesystem logically organized for a computer and index, where items are stored by ID-based filenames, is nearly unintelligible to a human and merging becomes hairy, and commit errors due to not being able to see the item path easily would be possible

Can I have it yet?

Why yes, yes you can. But…

…so if you could just go ahead and download the beta, that’d be great.

Rainbow and Unicorn 3 betas are currently available from NuGet (you’ll need to enable prerelease packages). For a fresh install it should be as simple as installing the Unicorn package - unless you want to hack around the Rainbow APIs, in which case Unicorn is not required. For upgrading from Unicorn 2, there’s a doc for that.

It’s vaguely stable, but hasn’t had extended testing in real life Sitecore development like a final release will have. It no doubt has some bugs. The bugs may be more than minor. Feel free to test it if that doesn’t scare you; send me a ticket on GitHub if you find issues. Now’s a great time for feature requests too!

Rethinking the Sitecore Serialization Format: Rainbow Preview, part 1

If you’ve worked with Sitecore in a team setting for any length of time, you’ve probably had to deal with item serialization. Item serialization is nearly a requirement to be effective when you need to share templates, renderings, placeholder settings, custom experience buttons, and all the other Sitecore items stored in the databases that are effectively development artifacts, as opposed to content. Being development artifacts, we want to keep them under source control so we can version them, develop feature branches with them, and deploy them using continuous integration to our shared Sitecore installations.

If you’ve dealt with Sitecore’s serialization format in a team environment for any length of time, you’ve probably started to realize some of its shortcomings. Because,

Let’s pick on multilist fields, for example. This is what a multilist looks like in SSF:

----field----
field: {E391B526-D0C5-439D-803E-17512EAE6222}
name: Allowed Controls
key: allowed controls
content-length: 194

{E11BDB3B-1436-4059-90F6-DE2EE52A4EB4}|{D9C54253-37FF-4D64-8894-5373D8799361}|{F118E540-CC75-4AA9-A62B-D6ED9E6F77E4}|{A813194F-32F4-4501-A430-6602ABF73535}|{2F4ADF0B-9633-4EE9-B339-8CA32E2C3293}

Now let’s imagine using this in a team environment. Alice creates a new rendering, and needs to add it to placeholder settings so it can be used in Experience Editor. Meanwhile on another branch, Bob adds a different rendering he’s made to the same placeholder settings. What happens? Merge conflict. And not a simple, easy to solve conflict: one where a very long single line has to be merged by hand - because merging is line oriented. On top of that, you must not forget to recalculate the content-length. Oh joy.

Then let’s take a look at the data that’s stored. Do we need key to load the field? Do we need name even? Nope, though having a name around makes it easier to understand - we just don’t need two. Then how about that the format is endline specific - don’t leave home in Git without the special .gitattributes to leave those alone, or the files won’t be read by the Sitecore APIs.

Here’s another gremlin about serialization: it saves fields that aren’t important to development artifacts that are version controlled. Yes, I’m talking about you __Revision, __Updated by and __Updated. Certain Sitecore tools - like say the template editor - cause a ton of item saves and they are not picky about if any actual data fields have changed. This means that if I add a Foo field to my Bar template, the last updated on Bar, Foo, and every one of Bar‘s fields gets changed. Even if there is an actual data change, and the data change is auto-mergeable, you can still get conflicts on the statistical fields. Welcome to annoying merge conflict city, folks!

Part I: The JSON era

I’ve been at this a while. Version 1 is largely on the junk pile. Why? JSON.

JSON seemed like an obvious candidate format: it’s quick to parse, mature, and easy as pie to implement with JSON.NET and a few POCOs. In fact, its performance is quite similar to ye olde content-length above. It was certainly a step up, and I was not the only person to think of this idea. I learned a lot and got many ideas from Robin’s prototype, not the least of which was that we should reformat field values to make merging easier.

Oddly enough it was the idea of field reformatting that made JSON an untenable format. The problem is that merging is line oriented - so we would want to reformat that multilist from the first example with one GUID per line such that merge tools could make sense of it. Well JSON does not allow literal newlines in data values, instead it uses the string literal \r and/or \n. Unhelpful - but it makes parsing really fast.

Part II: YAML Ain’t Markup Language

With JSON in the bin, I started poking around for formats that had already been created which would support our needs. YAML (the acronym is the title of this section) fit the bill nicely. The design goals of YAML are to be a more human readable superset of JSON for things like configuration files and object persistence. Allows multiple lines, human readable, allows lists and nesting - nice.

Well the downside is that because of the flexibility of the format, YAML parsers are on the order of 10-100x slower than JSON parsers. It was slooooow. But the good news is that the YAML-based serialization format I had designed was much, much simpler than the entire YAML specification. So I wrote my own reader and writer that supported only the subset of YAML that was necessary. It was fast. The format was easy to read and understand. It had the ability to add field formatters to make values mergeable (at present, for multilists and layout fields). I wanted to start using it pretty badly in real projects :)

So without further ado, here is the same item that we took the multilist from above, but in YAML:

---
ID: 38ddd69e-fb0a-4970-926e-dfb0e5b9a5e1
Parent: 68e4c671-797d-4a89-8fa6-775926f1381d
Template: 5c547d4e-7111-4995-95b0-6b561751bf2e
Path: /sitecore/layout/Placeholder Settings/reductio/ad/absurdum
SharedFields:
- ID: 7256bdab-1fd2-49dd-b205-cb4873d2917c
  # Placeholder Key
  Value: heading
- ID: e391b526-d0c5-439d-803e-17512eae6222
  # Allowed Controls
  Type: TreelistEx
  Value: |
    {E11BDB3B-1436-4059-90F6-DE2EE52A4EB4}
    {D9C54253-37FF-4D64-8894-5373D8799361}
    {F118E540-CC75-4AA9-A62B-D6ED9E6F77E4}
    {A813194F-32F4-4501-A430-6602ABF73535}
    {2F4ADF0B-9633-4EE9-B339-8CA32E2C3293}
Languages:
- Language: en
  Versions:
  - Version: 1
    Fields:
    - ID: 25bed78c-4957-4165-998a-ca1b52f67497
      # __Created
      Value: 20100310T143300
    - ID: 52807595-0f8f-4b20-8d2a-cb71d28c6103
      # __Owner
      Value: sitecore\admin
    - ID: 5dd74568-4d4b-44c1-b513-0af5f4cda34f
      # __Created by
      Value: sitecore\admin
    - ID: 87871ff5-1965-46d6-884f-01d6a0b9c4c1
      # Description
      Value: <p>The heading of a page, above any content renderings.</p>

Notice how the fields’ names - nonessential data - are in YAML comments. Present for humans to read, not necessary for the machine. The Allowed Controls TreelistEx field is also an example of the YAML multiline format, starting with the pipe and the data on a newline, indented further. YAML uses significant whitespace to define structure, making it easy to read and also not requiring hacks like content-length to efficiently parse.

You may notice that only the Allowed Controls field has a field type value. This is because the value was formatted with a FieldFormatter, so when deserializing the value it uses the type to figure out which formatter to “unformat” the value back into Sitecore with.

The way that languages are structured is also slightly different here: each language has a top level list item, but versions are all grouped hierarchically under their language. So in this format, language is a construct aside from “da-DK #1” like the Sitecore database uses.

We’ve also used field level ignores to not even store the constantly changing statistics fields (e.g. __Updated). This is optional, if you choose to do so, but it does lead to wonderfully compact files on disk.

The YAML format is also completely endline ignorant - it doesn’t care if it gets \n or \r\n.

Yes, please?

Well sorry - it’s not quite done yet, though the YAML part is pretty stable. The YAML format is a component of my upcoming Rainbow library. Rainbow is essentially a modernized serialization API that aggressively improves both the serialization format and the storage hierarchy, plus providing deep item comparison capabilities.

Rainbow aims to only be an API and has no default configuration or frontend. It will be freely available to use.

I have several projects that I intend to use Rainbow for - maybe you can think of some uses too?

  • Unicorn 3.0, being developed alongside it, will use Rainbow for storage, comparison, and formatting needs.
  • Rainbow for SPE will enable serializing and deserializing items in YAML using Sitecore Powershell Extensions

At present, Rainbow is alpha-quality. I wouldn’t use it unless you want to get your hands dirty, and I might change APIs as needed.

In the next post of this series, I’ll go over thinking behind improvements to the standard storage hierarchy to enable infinite depth while still being human readable. But there are still some bugs to fix before I write that ;)

Transparent media optimization with Dianoga 2

I’ve pushed a significant update to Dianoga, my framework to transparently optimize the size of media library images, including rescaled sizes.

My original post on Dianoga 1.0 explains why you’d want to use it in detail, but as an overview how does saving 5-40% overall on image bandwidth sound? All you have to do is install a NuGet package that adds one dll and one config file to your solution. That’s it. You can also find documentation on GitHub.

What’s new in 2.0

Dianoga 2 is faster and more extensible. The key change is that media is now optimized asynchronously, resulting in a near-zero impact on site performance even for the first hit. Dianoga 1.0 ran its optimization in the getMediaStream pipeline, which meant it had to complete the optimization while the browser was waiting for the image. For a large header banner, this could be a couple seconds on slower hardware (for the first hit only). Dianoga 2 replaces the MediaCache and performs asynchronous optimization after the image has been cached. This does mean that the first hit receives unoptimized images, but subsequent requests after the cache has been optimized get the smaller version.

Extensibility has also been improved in Dianoga 2. Previously, there were a lot of hard-coded things such as the path to the optimization tools and the optimizers themselves. Now you can add and remove optimizers in the config (including adding your own, if you wanted say a PNG quantizer), and move the tools around if desired.

There was a bug in Dianoga 1.0 that resulted in the tools DLL becoming locked, which could cause problems with deployment tools. This has been fixed in 2.0 by dynamically loading and unloading at need. Thanks to Markus Ullmark for the issue report.

Upgrade

Upgrading from Dianoga 1.0 to 2.0 is fairly simple; upgrade the NuGet package. You can choose to overwrite your Dianoga.config file (recommended), or merge it with the latest in GitHub if you don’t want to.

Have fun!

Sitecore Azure Role Sizing Guide

It seems like everyone is hot to get Sitecore on Azure these days, but there’s not a whole lot of guidance out there on how Sitecore scales effectively in an Azure environment. Since I get asked the “what size instances do we need?” or “how much will Azure cost us?” question a lot, I figured some testing was in order.

The environment I tested in is a real actual customer website on Sitecore 7.2 Update-4 (not yet launched but in late beta). The site has been relatively optimized with normal practices - namely, appropriate application of output caching and in-memory object caching. The site scores a B on WebPageTest - due to things out of my control - and is in decent shape. We are deploying to Azure Cloud Services (PaaS), using custom PowerShell scripting - not the Sitecore Azure module (we have had suboptimal experiences with the module). We’re using SQL Azure as the database backend, and the Azure Cache session provider. Note that the Azure Cache session is not supported for xDB (7.5/8) scenarios.

The Tests

I wanted to determine the effects of changing various cloud scaling options relative to real world performance, but I also didn’t want to spend days benchmarking. So I settled on two test scenarios:

  • Frontend performance. This is a JMeter test that hits the site with 10,000 requests to the home page, with 1,000 threads at once. The idea is to beat the server into submission and see how it does.
  • Backend performance. I chose Lucene search index rebuilding, as that is a taxing process on both the database and server CPUs, and it conveniently reports how long it took to run. Note that you should probably be using Solr for any xDB scenarios or any search-heavy sites.

These are not supposed to be ‘end of story’ benchmarks. There are no doubt many bones to pick with them, and they’re certainly not comprehensive. But they do paint a decent picture of overall scaling potential.

Testing Environment

The base testing environment consists of an Azure Cloud Service running two A3 (4-core, 7.5GB RAM, HDDs) web roles. The databases (core, master, web, analytics) are SQL Azure S3 instances running on a SQL Azure v12 (latest) server.

Several role variants were tested:

  • 2x A3 roles, which are 4-cores with 7GB RAM and HDD local storage (~$268/mo each)
  • 2x D2 roles, which are 2-cores (D-series cores are ‘60% faster’ per Microsoft) 7GB RAM and SSD local storage (~$254/mo each)
  • 2x D3 roles, 4-cores 14GB RAM and SSD local storage (~$509/mo each)
  • 2x D4 roles, 8-cores 28GB RAM and SSD local storage (~$1018/mo each)

In addition, I tested the effect of changing the SQL Azure sizes on the backend with the D4 roles. I did not test the frontend in these cases, because that seemed to scale well on the S3s already.

  • S3 databases, which are 100DTU “standard” (~$150/mo each)
  • P1 databases, which are 125DTU “premium” (~$465/mo each)
  • P2 databases, which are 250DTU “premium” (~$930/mo each)

Results

Our first set of tests are comparing the size of the web roles in Azure. It is quite likely that these results would also be loosely applicable to IaaS virtual machine deployments as well.

Throughput

The throughput graph is about as expected: add more cores, get more scaling. The one surprise is that the D2 instance, with 4 total cores at a higher clock and SSD disk, is able to match the A3 with 8 total cores (D2 and A3 cost a similar amount).

Keep in mind that all of these were using the same SQL Azure S3 databases as the backend - for general frontend service, especially with output caching, Sitecore is extremely database-efficient and does not bottleneck on lower grade databases even with 16 cores serving.

Note that I strongly suspect the bandwidth between myself and Azure was the bottleneck on the D4 results, as I saw it burst up to more like 500 during the main portion of the test.

Latency

Latency continues the interesting battle between A3 and D2. The A3 pulls out a minor - within margin of error - victory on average but the SSD of the D2 gives it a convincing performance consistency win with the 95% line over 1 second faster. Given that A3 and D2 cost a similar amount, it seems D2 would be a good choice for budget conscious hosting - but if you can afford Sitecore, you should probably stretch your licenses to more cores per instance.

The D3 is probably the ideal general purpose choice for your web roles.

The latency numbers on the monstrous 8-core D4 instances tell the tale that my 1,000 threads and puny 100Mb/s bandwidth were just making the servers laugh.

These graphs illustrate the latency consistency you gain from a D-series role with SSD:

These two graphs plot the latency over time during the JMeter test. You can see how the top one (A3) is much less consistent thanks to its HDDs, whereas the lower one (D3) is smoother, indicating fewer latency spikes. The latency decreases in the final stage of the test due to some JMeter threads having completed their 10-request run earlier than others, so less server load is in play.

Bandwidth

Bandwidth is another measure of how much data is being pushed out. Once again, the lesser gain than you might expect from D4 was due to my downstream connection becoming saturated as the monstrous servers laughed.

Backend: Rebuild all indexes

For this test, I rebuilt the default sitecore_core_index, sitecore_master_index, and sitecore_web_index indices.

Here we see the limitations of the S3 databases creep in during our D4 testing. Up to D3 seems to be limited by CPU speed, but the D4 is actually slower until we both up the max index rebuild parallelism and go to P2 databases to feed the monstrous 8-core beast.

Note that with SQL Azure, the charge is per-db - so if you wanted to kit out core, master, and web on P2s you’d be paying near $3,000/month just for the DBs. At that point it might be worth considering using IaaS SQL or looking at the Elastic Database Pools to spread around the bursty load of many databases.

Final Words

Overall Sitecore is a very CPU-intensive application where you can generally get away with saving a few bucks on the databases in favor of more compute, especially in content delivery. As in most applications, SSDs make a significant performance improvement to the point where I would suggest Azure D3 instances for nearly all normal Sitecore in Azure deployments, unless you need Real Ultimate Power from a D4 - or for the truly ridiculous a 16-core D14 ;)

For general frontend scaling the SQL Azure S3 seems to be the best price/performance until you go past quad cores, or past two web roles (one would assume 4x D3 would be relatively similar to 2x D4). Once you have 16 cores serving the frontend, you’ll be bottlenecked below P2 databases at least for index rebuilding. Publishing probably has similar characteristics.

As always when designing for scale, the first step should be to optimize your code and caching. For example, before I did any caching optimizations I did the same JMeter test on the site. The A3s only put out 40 requests/sec - after output caching and tweaks, they did 144 requests/sec. That’s not to say the site was slow pre-optimization: TTFB was about 220ms without load. But once you add load, even small optimizations make a huge difference.