Friday, January 6, 2017

Azure Data Lake Analytics U-SQL Extracting From JSON At Multiple Levels


JSON documents are basically upside down trees.  Working with structures like this can present challenges in any programming language.  Microsoft Azure has a language called U-SQL that is used
for processing set based data in Data Lake Analytics.  Working with trees in set based languages like T-SQL and U-SQL can be especially challenging as concepts like recursion and looping aren't part of the set based methodology.  Working with U-SQL when accessing JSON files presented similar problems.  Originally, there was a JsonExtractor in the U-SQL samples on GitHub which allows you to start at one level of a JSON document and work downward into the tree from there.  You can define the level you want to start at but if you need to higher up the tree you would have to pull those out in separate queries which, if your document has a lot of levels, can be a bit exhaustive.  In an effort to reduce the number of queries I wanted to be able to grab data at multiple levels, different levels from the point at which I want to iterate from.  When thinking about U-SQL you would work with the data in tables as you query and process it so when working with JSON and flattening it out you will normally be working through an array and turning the objects in that array into a table row.

As we look at the JSON document below you have the representation of the items that are for sale in a region of food markets.  There is an array of locations and four sales categories; produce, baked goods, restaurant items, and non perishable.  Those each have an array inside of them.  Now suppose I want to see all of the produce items with the MarketId and the MarketName in a set of rows in a table.  I would need to start at the produce level of the document and get everything within it and then also get the MarketId and MarketName from the grand parent (parent's parent) of Produce.

To support this, I added the MultiLevelJsonExtractor.  This is now part of the GitHub examples for U-SQL located here.  This extractor inherits from the normal JsonExtractor that was originally provided but also will look up at the ancestors in the tree for pre-defined properties.

The main difference in the code is that you need to provide the properties you are looking for and what level they are at as you would with the normal extractor.  In order to extract the query I describe above, I would need to write the following U-SQL:


REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];

@produce =
    EXTRACT Name string,
            DatePicked DateTime,
            ShelfLifeInDays int,
            Calories int,
            Price decimal,
            MarketId string,
            MarketName string
    FROM "/Samples/Data/MarketLocations.json"
    USING new CodeGenerationSample.Json.MultiLevelJsonExtractor("MarketLocations[*].SalesCategories.Produce[*]",
          false,
          "Name",
          "DatePicked",
          "ShelfLifeInDays",
          "Calories",
          "Price",
          "MarketId",
          "MarketName");


OUTPUT @produce
TO "/Samples/Output/produce.csv"
USING Outputters.Csv(outputHeader : true);
       
What you get here is it will start from the Produce array and work downward from there. The MultiLevelJsonExtractor has three arguments.  First is the JSON path to iterate through, second is a boolean if you want to bypass an exception thrown while looking for a specified property, third is a set of string params that will be the properties it will try to locate from the point in the document you specified in the first argument.  It will locate the properties within the objects in those arrays first for each of the properties in the third argument of the extractor.  If the property isn't located at that level it will recursively work up the tree to see if it can be located in the ancestors of the iteration point.

NOTE:
If you have the same property name in the iteration point as the parent you won't be able to locate it.  I haven't found a good way to achieve this yet.


{
  "Region": "Main",
  "MarketLocations": [
    {
      "MarketName": "Central",
      "MarketId": 1,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Southern",
      "MarketId": 2,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Apple",
            "Name": "Granny Smith",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 24,
            "Calories": 45,
            "Price": 0.29
          },
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Apple Pie",
            "Price": 14.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Turkey Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          }
        ],
        "NonPerishable": [
          {
            "Name": "Honey Mustard",
            "Type": "Condiments"
          }
        ]
      }
    },
    {
      "MarketName": "Western",
      "MarketId": 3,
      "SalesCategories": {
        "Produce": [
          {
            "Type": "Plums",
            "Name": "Red Plums",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 12,
            "Calories": 21,
            "Price": 0.33
          },
          {
            "Type": "Pears",
            "Name": "Golden Nature",
            "DatePicked": "2016-11-08T14:14:33.712Z",
            "ShelfLifeInDays": 20,
            "Calories": 40,
            "Price": 0.45
          }
        ],
        "BakedGoods": [
          {
            "DateMade": "2016-11-08T14:14:33.712Z",
            "Name": "Plum Pie",
            "Price": 18.25
          }
        ],
        "RestaurantItems": [
          {
            "Name": "Ham Sandwich",
            "Price": 4.85,
            "PreparationTimeInMinutes": 20
          },
          {
            "Name": "Chicken Soup",
            "Price": 2.25,
            "PreparationTimeInMinutes": 5
          }
        ],
        "NonPerishable": [
          {
            "Name": "Mayo",
            "Type": "Condiments"
          },
          {
            "Name": "Syrup",
            "Type": "Condiments"
          },
          {
            "Name": "Ginger",
            "Type": "Spices"
          }
        ]
      }
    }
  ]
}

No comments:

Post a Comment

About Me

My photo
I am a senior software engineer in the Midwest. I have been working in .NET for about 14 years now and love what I do. I have benefited a lot from the development community and would like to give back to help others. I work at BlueMetal and specialize in cloud development.