Friday, January 6, 2017

Azure Data Lake Analytics U-SQL Extracting From JSON At Multiple Levels


JSON documents are basically upside down trees.  Working with structures like this can present challenges in any programming language.  Microsoft Azure has a language called U-SQL that is used
for processing set based data in Data Lake Analytics.  Working with trees in set based languages like T-SQL and U-SQL can be especially challenging as concepts like recursion and looping aren't part of the set based methodology.  Working with U-SQL when accessing JSON files presented similar problems.  Originally, there was a JsonExtractor in the U-SQL samples on GitHub which allows you to start at one level of a JSON document and work downward into the tree from there.  You can define the level you want to start at but if you need to higher up the tree you would have to pull those out in separate queries which, if your document has a lot of levels, can be a bit exhaustive.  In an effort to reduce the number of queries I wanted to be able to grab data at multiple levels, different levels from the point at which I want to iterate from.  When thinking about U-SQL you would work with the data in tables as you query and process it so when working with JSON and flattening it out you will normally be working through an array and turning the objects in that array into a table row.

As we look at the JSON document below you have the representation of the items that are for sale in a region of food markets.  There is an array of locations and four sales categories; produce, baked goods, restaurant items, and non perishable.  Those each have an array inside of them.  Now suppose I want to see all of the produce items with the MarketId and the MarketName in a set of rows in a table.  I would need to start at the produce level of the document and get everything within it and then also get the MarketId and MarketName from the grand parent (parent's parent) of Produce.

To support this, I added the MultiLevelJsonExtractor.  This is now part of the GitHub examples for U-SQL located here.  This extractor inherits from the normal JsonExtractor that was originally provided but also will look up at the ancestors in the tree for pre-defined properties.

The main difference in the code is that you need to provide the properties you are looking for and what level they are at as you would with the normal extractor.  In order to extract the query I describe above, I would need to write the following U-SQL:


REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];

@produce =
    EXTRACT Name string,
            DatePicked DateTime,
            ShelfLifeInDays int,
            Calories int,
            Price decimal,
            MarketId string,
            MarketName string
    FROM "/Samples/Data/MarketLocations.json"
    USING new CodeGenerationSample.Json.MultiLevelJsonExtractor("MarketLocations[*].SalesCategories.Produce[*]",
          false,
          "Name",
          "DatePicked",
          "ShelfLifeInDays",
          "Calories",
          "Price",
          "MarketId",
          "MarketName");


OUTPUT @produce
TO "/Samples/Output/produce.csv"
USING Outputters.Csv(outputHeader : true);
       
What you get here is it will start from the Produce array and work downward from there. The MultiLevelJsonExtractor has three arguments.  First is the JSON path to iterate through, second is a boolean if you want to bypass an exception thrown while looking for a specified property, third is a set of string params that will be the properties it will try to locate from the point in the document you specified in the first argument.  It will locate the properties within the objects in those arrays first for each of the properties in the third argument of the extractor.  If the property isn't located at that level it will recursively work up the tree to see if it can be located in the ancestors of the iteration point.

NOTE:
If you have the same property name in the iteration point as the parent you won't be able to locate it.  I haven't found a good way to achieve this yet.


{
  "Region": "Main",
  "MarketLocations": [
    {
      "MarketName": "Central",
      "MarketId": 1,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Southern",
      "MarketId": 2,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          },
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Western",
      "MarketId": 3,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Plum Pie",
            "Price": 18.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Ham Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          },
          {
            "Name": "Chicken Soup",
            "Price": 2.25,
            "PreparationTimeInMinutes": 5
          }
        ],
        "NonPerishable": [
          {
            "Name": "Mayo",
            "Type": "Condiments"
          },
          {
            "Name": "Syrup",
            "Type": "Condiments"
          },
          {
            "Name": "Ginger",
            "Type": "Spices"
          }
        ]
      }
    }
  ]
}

Wednesday, November 9, 2016

Azure Data Lake Analytics (U-SQL) Code Generation

Most of us developers get frustrated when it comes to doing work that we feel could be done better or more efficiently.  Personally, I got into development because I realized how software could help me automate my own job as well as the repetitive tasks that my fellow colleagues were working on.


Early in my career, one of my mentors talked about the power of code generation.  At the time, he was talking about it in the context of ASP.NET and PHP but the power of code generation has proved as useful back then as it is today.

When working on large code bases, simple changes can be daunting even if the code is well designed.  When dealing with data with hundreds or thousands of properties I have found code generation to be the most reliable way to manage and tolerate changes in the structure of the code.

New Language, Same Problems

Recently, I was involved in a project to build some analytics with Azure Data Lake Analytics which uses U-SQL.  U-SQL is like T-SQL and is used for Big Data analytics by providing for set based data processing.  There are nice tools for processing data in C# and then utilizing the U-SQL language to define what you want to output into your destination formats.  In an effort to make the code manageable, I wanted to have a small script which would allow overriding default behaviors but leave the details and all of the properties in a separate script.  Unfortunately, I ran into some road blocks that forced me to look at new ways of this is not allowed.

Some of these road blocks included:

  1. T4 templates aren't allowed in the U-SQL projects.  The only file type that is allowed is a U-SQL script.
  2. Like T-SQL, there is no way to call one script from another script, creating the potential for very large scripts.
  3. With U-SQL you can use SELECT * if you have defined the column set somewhere but you need to define it somewhere.
  4. Inbound format was JSON so all the data was coming in a single type of file.
I would like to go through each of these and discuss what tools and strategies I used to work through them.  

Solution Project Dependencies

The build order of the project was important so that the code generation would refresh the code before the tests ran.  Here are the dependencies:

CodeGenerationSample.Scripts.Tests depends on
CodeGenerationSample.Scripts which depends on
CodeGenerationSample.Json which depends on
CodeGenerationSample.CodeGen

Code Generation

This was probably the easiest road block to get around.  I created a separate C# project in the solution to add the T4 templates to.  In an effort to avoid compile warnings and errors I used a file extension of genusql and then used a post-build event to copy the files over to the scripts project.  

I also used the TextTransform.exe to transform on the post-build so that there is as little manual work as possible.

Post-Build

if $(ConfigurationName) == Debug (
"C:\Program Files (x86)\Common Files\Microsoft Shared\TextTemplating\14.0\TextTransform.exe" -r "$(TargetDir)Newtonsoft.Json.dll" "$(ProjectDir)TypeRemover.tt"
"C:\Program Files (x86)\Common Files\Microsoft Shared\TextTemplating\14.0\TextTransform.exe" -r "$(TargetDir)Newtonsoft.Json.dll" "$(ProjectDir)TypeGenerator.tt"
"C:\Program Files (x86)\Common Files\Microsoft Shared\TextTemplating\14.0\TextTransform.exe" -r "$(TargetDir)Newtonsoft.Json.dll"  "$(ProjectDir)FunctionGenerator.tt"
xcopy "$(ProjectDir)TypeRemover.genusql" "$(SolutionDir)CodeGenerationSample.Scripts\TypeRemover.usql" /Y /D
xcopy "$(ProjectDir)TypeGenerator.genusql" "$(SolutionDir)CodeGenerationSample.Scripts\TypeGenerator.usql" /Y /D
xcopy "$(ProjectDir)FunctionGenerator.genusql" "$(SolutionDir)CodeGenerationSample.Scripts\FunctionGenerator.usql" /Y /D
)

Making it Manageable

Take the following tree structure.

{
  "MarketLocations": [
    {
      "MarketName": "Central",
      "MarketId": 1,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Southern",
      "MarketId": 2,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          },
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Western",
      "MarketId": 3,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Plum Pie",
            "Price": 18.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Ham Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          },
          {
            "Name": "Chicken Soup",
            "Price": 2.25,
            "PreparationTimeInMinutes": 5
          }
        ],
        "NonPerishable": [
          {
            "Name": "Mayo",
            "Type": "Condiments"
          },
          {
            "Name": "Syrup",
            "Type": "Condiments"
          },
          {
            "Name": "Ginger",
            "Type": "Spices"
          }
        ]
      }
    }
  ]
}
This is where it got interesting.  I experimented with a number of techniques that would allow me to pass around the table valued parameters or complex types once they were pulled out of the JSON files.  One of the tricky things about the JSON extractor is it extracts one level of the JSON document at a time so you get JSON strings to represent the parts of the JSON document that you want to work with.  So I would build U-SQL Table Types to represent those.

I also built a multi-level JSON extractor that would allow me to get something from a parent path and a child path.  I will create another blog entry for that at another time but it allowed me to put some of the parts together without concrete connections between the levels of data in the JSON document.

So for the tree above I would create types like the following:

//This is generated code, please do not change.  

CREATE TYPE IF NOT EXISTS CategoryType AS TABLE
(
 [Produce] string,
 [BakedGoods] string,
 [RestaurantItems] string,
        MarketId string,
        MarketName string
); 

CREATE TYPE IF NOT EXISTS Produce AS TABLE
(
 [_GEN__Type] string,
 [Name] string,
 [DatePicked] string,
 [ShelfLifeInDays] string,
 [Calories] string,
 [Price] string,
        MarketId string,
        MarketName string
);

CREATE TYPE IF NOT EXISTS BakedGoods AS TABLE
(
 [DateMade] string,
 [Name] string,
 [Price] string,
        MarketId string,
        MarketName string
);

CREATE TYPE IF NOT EXISTS RestaurantItems AS TABLE
(
 [Name] string,
 [Price] string,
 [PreparationTimeInMinutes] string,
        MarketId string,
        MarketName string
);

I did find that you cannot have the DROP and CREATE scripts in the same place for the Types.  Not sure why but I worked around it by creating a script only for dropping the types.  I generate this also.

//This is generated code, please do not change.  
DROP TYPE IF EXISTS Produce;
DROP TYPE IF EXISTS BakedGoods;
DROP TYPE IF EXISTS RestaurantItems;
DROP TYPE IF EXISTS CategoryType;

Then the table valued functions also. See how the parameter that is provided @table is a table type that I can select from?  This is a table with mostly rows of JSON strings.

//This is generated code, please do not change.  

/* ------------------------------------ Function Produce ------------------------------------- */

DROP FUNCTION IF EXISTS ProduceFunction;
  
CREATE FUNCTION IF NOT EXISTS ProduceFunction(@table CategoryType)
RETURNS @formatted Produce
AS BEGIN
  
  
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
  
@staging =
 SELECT 
 CodeGenerationSample.Json.JsonFunctions.JsonTuple(Produce) AS Result,
 MarketId,
        MarketName
 FROM @table;

@staging =
 SELECT 
        CodeGenerationSample.Json.JsonFunctions.JsonTuple(Exploded) AS Result,
 MarketId,
        MarketName
 FROM @staging
 CROSS APPLY
 EXPLODE(Result.Values) AS dp(Exploded);
       
  
       
@formatted =
    SELECT 
 Result["Type"] AS _GEN__Type,
 Result["Name"] AS Name,
 Result["DatePicked"] AS DatePicked,
 Result["ShelfLifeInDays"] AS ShelfLifeInDays,
 Result["Calories"] AS Calories,
 Result["Price"] AS Price,
 MarketId,
        MarketName
 FROM @staging;
    
 RETURN;
END;

/* ------------------------------------ Function BakedGoods ------------------------------------- */

DROP FUNCTION IF EXISTS BakedGoodsFunction;
  
CREATE FUNCTION IF NOT EXISTS BakedGoodsFunction(@table CategoryType)
RETURNS @formatted BakedGoods
AS BEGIN
  
  
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
  
@staging =
 SELECT 
 CodeGenerationSample.Json.JsonFunctions.JsonTuple(BakedGoods) AS Result,
 MarketId,
        MarketName
 FROM @table;

@staging =
 SELECT 
        CodeGenerationSample.Json.JsonFunctions.JsonTuple(Exploded) AS Result,
 MarketId,
        MarketName
 FROM @staging
        CROSS APPLY
        EXPLODE(Result.Values) AS dp(Exploded);
       
  
       
@formatted =
    SELECT 
 Result["DateMade"] AS DateMade,
 Result["Name"] AS Name,
 Result["Price"] AS Price,
 MarketId,
        MarketName
 FROM @staging;
    
 RETURN;
END;

/* ------------------------------------ Function RestaurantItems ------------------------------------- */

DROP FUNCTION IF EXISTS RestaurantItemsFunction;
  
CREATE FUNCTION IF NOT EXISTS RestaurantItemsFunction(@table CategoryType)
RETURNS @formatted RestaurantItems
AS BEGIN
  
  
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
  
@staging =
 SELECT 
 CodeGenerationSample.Json.JsonFunctions.JsonTuple(RestaurantItems) AS Result,
 MarketId,
        MarketName
 FROM @table;

@staging =
 SELECT 
        CodeGenerationSample.Json.JsonFunctions.JsonTuple(Exploded) AS Result,
 MarketId,
        MarketName
 FROM @staging
 CROSS APPLY
 EXPLODE(Result.Values) AS dp(Exploded);
       
  
       
@formatted =
    SELECT 
 Result["Name"] AS Name,
 Result["Price"] AS Price,
 Result["PreparationTimeInMinutes"] AS PreparationTimeInMinutes,
 MarketId,
        MarketName
 FROM @staging;
    
 RETURN;
END;

Now what we have are 3 queries.

  1. Pull the JSON into a SQLMap from the original table @table 
  2. Explode the Array and Parse each JSON item in the array
  3. Then create an instance of the resulting table type.
Now to use this you will create a script such as the following to get the produce.

REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];

@categories =
    EXTRACT Produce string,
            BakedGoods string,
            RestaurantItems string,
            MarketId string,
            MarketName string
    FROM "/Samples/Data/MarketLocations.json"
    USING new CodeGenerationSample.Json.JsonExtractor("MarketLocations[*]", "SalesCategories.Produce", "SalesCategories.BakedGoods", "SalesCategories.RestaurantItems", "MarketId", "MarketName");


@produce =
    SELECT *
    FROM ProduceFunction
         (
             @categories
         ) AS V;

OUTPUT @produce
TO "/Samples/Output/produce.csv"
USING Outputters.Csv(outputHeader : true);
       
So we first need to create the categories table type that is fed into the ProduceFunction in the second query.  Then we can use a SELECT *  from the function to avoid specifying all the column names. This may be valuable to you to avoid specifying these over and over again and although we don't have a ton of properties here in this model, one could envision a huge data set where we did and then we would save ourselves a lot of work by using code generation.

Sample Code

I have included sample code that provides everything that you can use to observe the code generation. That really isn't the hard part here.  The difficulty is putting it all together.  I have the example code at the following GitHub repository.  https://github.com/kotvisbj/UsqlCodeGenerationSample.


About Me

My photo
I am a senior software engineer in the Midwest. I have been working in .NET for about 14 years now and love what I do. I have benefited a lot from the development community and would like to give back to help others. I work at BlueMetal and specialize in cloud development.