Another big obstacle for XSLT adoption is that it's hard to do groupings. Yet a lot of XML data comes not from the database (otherwise the right way to do groupings would be SQL anyway) but from some (un)known external source and needs to be grouped preferably during the transformation into a known structure.

In my earlier post about better visualizer for the XML output of the Visual Studio 2005 conversion wizard I mentioned how I used Muenchian grouping and Microsoft didn't, but I did explain exactly what the difference was. Besides grouping is just another great new addition to the XSLT 2.0 so I could not resist and reimplemented the core of the stylesheet (grouping part) in XSLT 2.0 in order to find out if it really is easier and/or better.

Here's the problem - some app (in this case Visual Studio 2005) outputs the XML that looks like this:

<UpgradeLog>
  <Properties>
    <Property Name="Solution" Value="My.Solution"/>
    <Property Name="Solution File" Value="C:\My.Solution.sln"/>
    <Property Name="User Options File" Value="C:\My.Solution.suo"/>
    <Property Name="Date" Value="24. decembar 2005"/>
    <Property Name="Time" Value="12:56"/>
  </Properties>
  <Event ErrorLevel="0" Project="" Source="My.Solution.sln" Description="File successfully backed up..."/>
  <Event ErrorLevel="0" Project="" Source="My.Solution.suo" Description="File successfully backed up..."/>
  <Event ErrorLevel="0" Project="My.AddIn" Source="My.AddIn.csproj" Description="Project file successfully backed up..."/>
  <Event ErrorLevel="0" Project="My.AddIn" Source="My.AddIn.csproj.user" Description="Project user file successfully backed up..."/>
  <Event ErrorLevel="0" Project="My.AddIn" Source="AmazonItemControl.cs" Description="File successfully backed up..."/>
<!-- rest of elements snipped -->

</UpgradeLog>

The job is to reorganize this into another XML that has more obvious structure - for each project (that is, Event with a distinct @Project), then for each source (again, distinct @Source inside each distinct project) list all the events. The output should look something like this:

<projects>
   <project name="" solution="My.Solution">
      <source name="My.Solution.sln">
         <event error-level="0" description="File successfully backed up..."/>
         <event error-level="0" description="Solution converted successfully"/>
         <event error-level="3" description="Converted"/>
      </source>
      <source name="My.Solution.suo">
         <event error-level="0" description="File successfully backed up..."/>
      </source>
   </project>
   <!-- more project elements here -->
</projects>

Nothing special - we turn a flat list into a tree so that it's more obvious what belongs to where. (btw complete example files that demonstrate all this can be downloaded from here). Let's first look at Microsoft's implementation. I will ignore for the moment the detail where a final tree is built from two steps where first one just does sorting, but note that it's there and precedes what I am about to show. First the stylesheet - the discussion follows:

<xsl:key name="ProjectKey" match="Event" use="@Project"/>
  <xsl:template match="Events" mode="createProjects">
    <projects>
      <xsl:for-each select="Event">
        <xsl:if test="(1=position()) or (preceding-sibling::*[1]/@Project != @Project)">
          <xsl:variable name="ProjectName" select="@Project"/>
          <project>
            <xsl:attribute name="name">
<xsl:value-of select="@Project"/>
</xsl:attribute>
            <xsl:if test="@Project=''">
              <xsl:attribute name="solution">
<xsl:value-of select="@Solution"/>
</xsl:attribute>
            </xsl:if>
            <xsl:for-each select="key('ProjectKey', $ProjectName)">
              <xsl:if test="(1=position()) or (preceding-sibling::*[1]/@Source != @Source)">
                <source>
                  <xsl:attribute name="name">
<xsl:value-of select="@Source"/>
</xsl:attribute>
                  <xsl:variable name="Source">
                    <xsl:value-of select="@Source"/>
                  </xsl:variable>
                  <xsl:for-each select="key('ProjectKey', $ProjectName)[ @Source = $Source ]">
                    <event>
                      <xsl:attribute name="error-level">
<xsl:value-of select="@ErrorLevel"/>
</xsl:attribute>
                      <xsl:attribute name="description">
<xsl:value-of select="@Description"/>
</xsl:attribute>
                    </event>
                  </xsl:for-each>
                </source>
              </xsl:if>
            </xsl:for-each>
          </project>
        </xsl:if>
      </xsl:for-each>
    </projects>
  </xsl:template>

Outer loop goes for-each Event element, then uses funky XPath expression to get rid of duplicate projects thus simulating the first grouping level. Then it uses the key to filter out all projects with the same name as the one in the outer loop, only to do the same XPath expression this time getting rid of duplicate sources thus simulating second level grouping. Final inner loop just filters all elements with the project and source attributes picked up from the outer loops.

Not really elegant IMHO, but relatively readable assuming you are comfortable with XPath. It also looks a bit suboptimal, but more on performance later. Let's have a look at my (slightly revised from the original post) version using XSLT 1.0:

<xsl:key name="events-by-project" match="Event" use="@Project"/>
  <xsl:key name="events-by-source" match="Event" use="@Source"/>
  <xsl:template match="UpgradeLog">
    <projects>
      <xsl:variable name="Projects"
select="Event[count(. | key('events-by-project', @Project)[1]) = 1]"/>
      <xsl:for-each select="$Projects">
        <xsl:sort select="@Project" order="ascending"/>
        <project>
          <xsl:variable name="prj" select="@Project"/>
          <xsl:attribute name="name">
            <xsl:value-of select="$prj"/>
          </xsl:attribute>
          <xsl:if test="$prj =''">
            <xsl:attribute name="solution">
              <xsl:value-of select="/UpgradeLog/Properties/Property[@Name = 'Solution']/@Value"/>
            </xsl:attribute>
          </xsl:if>
          <xsl:variable name="Sources"
select="../Event[@Project = $prj and count(. | key('events-by-source', @Source)[1]) = 1]"/>
          <xsl:for-each select="$Sources">
            <xsl:sort select="@Source" order="ascending"/>
            <source>
              <xsl:variable name="src" select="@Source"/>
              <xsl:attribute name="name">
                <xsl:value-of select="@Source"/>
              </xsl:attribute>
              <xsl:variable name="Descriptions" select="../Event[@Project = $prj and @Source = $src]"/>
              <xsl:for-each select="$Descriptions">
                <xsl:sort select="@ErrorLevel" order="ascending"/>
                <event>
                  <xsl:attribute name="error-level">
                    <xsl:value-of select="@ErrorLevel"/>
                  </xsl:attribute>
                  <xsl:attribute name="description">
                    <xsl:value-of select="@Description"/>
                  </xsl:attribute>
                </event>
              </xsl:for-each>
            </source>
          </xsl:for-each>
        </project>
      </xsl:for-each>
    </projects>
  </xsl:template>

It's not that much shorter, isn't it? Nevertheless, it should at least be more readable, assuming you know about Muenchian grouping. Looking at the usage of keys, it should also be more optimal. Let's see - first level is classic Muenchian - based on a key, I filter all out distinct Event nodes per project. Then I use the same technique grouping by source and filter by the project picked from the outer loop. Finally the most inner loop is more or less the same as Microsoft's. Note that I do sorting along the way while Microsoft did it in a step I did not show.

At last, let's have a look how this same thing looks like in XSLT 2.0 (tried with both AltovaXML and SaxonB):

<xsl:template match="UpgradeLog">
    <projects>
      <xsl:for-each-group select="Event" group-by="@Project">
        <xsl:sort select="@Project"/>
        <project>
          <xsl:attribute name="name" select="@Project"/>
          <xsl:if test="@Project = ''">
            <xsl:attribute name="solution"
select="/UpgradeLog/Properties/Property[@Name='Solution']/@Value"/>
          </xsl:if>
          <xsl:for-each-group select="current-group()" group-by="@Source">
            <xsl:sort select="@Source"/>
            <source>
              <xsl:attribute name="name" select="@Source"/>
              <xsl:for-each select="current-group()">
                <xsl:sort select="@ErrorLevel"/>
                <event>
                  <xsl:attribute name="error-level" select="@ErrorLevel"/>
                  <xsl:attribute name="description" select="@Description"/>
                </event>
              </xsl:for-each>
            </source>
          </xsl:for-each-group>
        </project>
      </xsl:for-each-group>
    </projects>
  </xsl:template>

Aha - that's more like it. Not only that it is shorter overall, it's a lot easier to read. Using a new for-each-group construct, it's a lot more obvious I am doing grouping here. Note that the code is visibly shorter also due to a small but important difference - value for xsl:attribute can be specified directly in the select attribute and we have quite a few of these.

What about speed and memory consumption? Turns out it's more or less the same across all three stylesheets, except that my 1.0 version needs a bit more memory. What's important to note is that while 2.0 version is by far the cleanest and the easiest to read (which to me is the most important quality here) you do not have to sacrify the speed nor the memory as a consequence - 2.0 version was consistently the fastest and used least memory (even if by a very small margin on both accounts).

If some of the XSLT 1.0 aspects frustrate you, try XSLT 2.0. You might be very pleasantly surprised. While Altova provides a lot better IDE in the shape of XMLSpy Home edition their XSLT processor is not as complete as Saxonica's, so I definitely recommend the latter. If you'd like to try some of the schema related features though (hopefully I'll write about that soon) then Altova's package is a better deal - it's free while Saxonica's solution is commercial.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
0 Comments