Early in my career, one of my mentors talked about the power of code generation. At the time, he was talking about it in the context of ASP.NET and PHP but the power of code generation has proved as useful back then as it is today.
When working on large code bases, simple changes can be daunting even if the code is well designed. When dealing with data with hundreds or thousands of properties I have found code generation to be the most reliable way to manage and tolerate changes in the structure of the code.
New Language, Same Problems
Recently, I was involved in a project to build some analytics with Azure Data Lake Analytics which uses U-SQL. U-SQL is like T-SQL and is used for Big Data analytics by providing for set based data processing. There are nice tools for processing data in C# and then utilizing the U-SQL language to define what you want to output into your destination formats. In an effort to make the code manageable, I wanted to have a small script which would allow overriding default behaviors but leave the details and all of the properties in a separate script. Unfortunately, I ran into some road blocks that forced me to look at new ways of this is not allowed.Some of these road blocks included:
- T4 templates aren't allowed in the U-SQL projects. The only file type that is allowed is a U-SQL script.
- Like T-SQL, there is no way to call one script from another script, creating the potential for very large scripts.
- With U-SQL you can use SELECT * if you have defined the column set somewhere but you need to define it somewhere.
- Inbound format was JSON so all the data was coming in a single type of file.
I would like to go through each of these and discuss what tools and strategies I used to work through them.
Solution Project Dependencies
The build order of the project was important so that the code generation would refresh the code before the tests ran. Here are the dependencies:
CodeGenerationSample.Scripts which depends on
CodeGenerationSample.Json which depends on
CodeGenerationSample.CodeGen
Code Generation
This was probably the easiest road block to get around. I created a separate C# project in the solution to add the T4 templates to. In an effort to avoid compile warnings and errors I used a file extension of genusql and then used a post-build event to copy the files over to the scripts project.
I also used the TextTransform.exe to transform on the post-build so that there is as little manual work as possible.
Post-Build
if $(ConfigurationName) == Debug ( "C:\Program Files (x86)\Common Files\Microsoft Shared\TextTemplating\14.0\TextTransform.exe" -r "$(TargetDir)Newtonsoft.Json.dll" "$(ProjectDir)TypeRemover.tt" "C:\Program Files (x86)\Common Files\Microsoft Shared\TextTemplating\14.0\TextTransform.exe" -r "$(TargetDir)Newtonsoft.Json.dll" "$(ProjectDir)TypeGenerator.tt" "C:\Program Files (x86)\Common Files\Microsoft Shared\TextTemplating\14.0\TextTransform.exe" -r "$(TargetDir)Newtonsoft.Json.dll" "$(ProjectDir)FunctionGenerator.tt" xcopy "$(ProjectDir)TypeRemover.genusql" "$(SolutionDir)CodeGenerationSample.Scripts\TypeRemover.usql" /Y /D xcopy "$(ProjectDir)TypeGenerator.genusql" "$(SolutionDir)CodeGenerationSample.Scripts\TypeGenerator.usql" /Y /D xcopy "$(ProjectDir)FunctionGenerator.genusql" "$(SolutionDir)CodeGenerationSample.Scripts\FunctionGenerator.usql" /Y /D )
Making it Manageable
Take the following tree structure.
{
"MarketLocations": [
{
"MarketName": "Central",
"MarketId": 1,
"SalesCategories": {
"Produce": [
{
"Type": "Apple",
"Name": "Granny Smith",
"DatePicked": "2016-11-08T14:14:33.712Z",
"ShelfLifeInDays": 24,
"Calories": 45,
"Price": 0.29
}
],
"BakedGoods": [
{
"DateMade": "2016-11-08T14:14:33.712Z",
"Name": "Apple Pie",
"Price": 14.25
}
],
"RestaurantItems": [
{
"Name": "Turkey Sandwich",
"Price": 4.85,
"PreparationTimeInMinutes": 20
}
],
"NonPerishable": [
{
"Name": "Honey Mustard",
"Type": "Condiments"
}
]
}
},
{
"MarketName": "Southern",
"MarketId": 2,
"SalesCategories": {
"Produce": [
{
"Type": "Apple",
"Name": "Granny Smith",
"DatePicked": "2016-11-08T14:14:33.712Z",
"ShelfLifeInDays": 24,
"Calories": 45,
"Price": 0.29
},
{
"Type": "Plums",
"Name": "Red Plums",
"DatePicked": "2016-11-08T14:14:33.712Z",
"ShelfLifeInDays": 12,
"Calories": 21,
"Price": 0.33
},
{
"Type": "Pears",
"Name": "Golden Nature",
"DatePicked": "2016-11-08T14:14:33.712Z",
"ShelfLifeInDays": 20,
"Calories": 40,
"Price": 0.45
}
],
"BakedGoods": [
{
"DateMade": "2016-11-08T14:14:33.712Z",
"Name": "Apple Pie",
"Price": 14.25
}
],
"RestaurantItems": [
{
"Name": "Turkey Sandwich",
"Price": 4.85,
"PreparationTimeInMinutes": 20
}
],
"NonPerishable": [
{
"Name": "Honey Mustard",
"Type": "Condiments"
}
]
}
},
{
"MarketName": "Western",
"MarketId": 3,
"SalesCategories": {
"Produce": [
{
"Type": "Plums",
"Name": "Red Plums",
"DatePicked": "2016-11-08T14:14:33.712Z",
"ShelfLifeInDays": 12,
"Calories": 21,
"Price": 0.33
},
{
"Type": "Pears",
"Name": "Golden Nature",
"DatePicked": "2016-11-08T14:14:33.712Z",
"ShelfLifeInDays": 20,
"Calories": 40,
"Price": 0.45
}
],
"BakedGoods": [
{
"DateMade": "2016-11-08T14:14:33.712Z",
"Name": "Plum Pie",
"Price": 18.25
}
],
"RestaurantItems": [
{
"Name": "Ham Sandwich",
"Price": 4.85,
"PreparationTimeInMinutes": 20
},
{
"Name": "Chicken Soup",
"Price": 2.25,
"PreparationTimeInMinutes": 5
}
],
"NonPerishable": [
{
"Name": "Mayo",
"Type": "Condiments"
},
{
"Name": "Syrup",
"Type": "Condiments"
},
{
"Name": "Ginger",
"Type": "Spices"
}
]
}
}
]
}
This is where it got interesting. I experimented with a number of techniques that would allow me to pass around the table valued parameters or complex types once they were pulled out of the JSON files. One of the tricky things about the JSON extractor is it extracts one level of the JSON document at a time so you get JSON strings to represent the parts of the JSON document that you want to work with. So I would build U-SQL Table Types to represent those.
I also built a multi-level JSON extractor that would allow me to get something from a parent path and a child path. I will create another blog entry for that at another time but it allowed me to put some of the parts together without concrete connections between the levels of data in the JSON document.
So for the tree above I would create types like the following:
I also built a multi-level JSON extractor that would allow me to get something from a parent path and a child path. I will create another blog entry for that at another time but it allowed me to put some of the parts together without concrete connections between the levels of data in the JSON document.
So for the tree above I would create types like the following:
//This is generated code, please do not change.
CREATE TYPE IF NOT EXISTS CategoryType AS TABLE
(
[Produce] string,
[BakedGoods] string,
[RestaurantItems] string,
MarketId string,
MarketName string
);
CREATE TYPE IF NOT EXISTS Produce AS TABLE
(
[_GEN__Type] string,
[Name] string,
[DatePicked] string,
[ShelfLifeInDays] string,
[Calories] string,
[Price] string,
MarketId string,
MarketName string
);
CREATE TYPE IF NOT EXISTS BakedGoods AS TABLE
(
[DateMade] string,
[Name] string,
[Price] string,
MarketId string,
MarketName string
);
CREATE TYPE IF NOT EXISTS RestaurantItems AS TABLE
(
[Name] string,
[Price] string,
[PreparationTimeInMinutes] string,
MarketId string,
MarketName string
);
I did find that you cannot have the DROP and CREATE scripts in the same place for the Types. Not sure why but I worked around it by creating a script only for dropping the types. I generate this also.
Then the table valued functions also. See how the parameter that is provided @table is a table type that I can select from? This is a table with mostly rows of JSON strings.
//This is generated code, please do not change.
DROP TYPE IF EXISTS Produce;
DROP TYPE IF EXISTS BakedGoods;
DROP TYPE IF EXISTS RestaurantItems;
DROP TYPE IF EXISTS CategoryType;
Then the table valued functions also. See how the parameter that is provided @table is a table type that I can select from? This is a table with mostly rows of JSON strings.
//This is generated code, please do not change.
/* ------------------------------------ Function Produce ------------------------------------- */
DROP FUNCTION IF EXISTS ProduceFunction;
CREATE FUNCTION IF NOT EXISTS ProduceFunction(@table CategoryType)
RETURNS @formatted Produce
AS BEGIN
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
@staging =
SELECT
CodeGenerationSample.Json.JsonFunctions.JsonTuple(Produce) AS Result,
MarketId,
MarketName
FROM @table;
@staging =
SELECT
CodeGenerationSample.Json.JsonFunctions.JsonTuple(Exploded) AS Result,
MarketId,
MarketName
FROM @staging
CROSS APPLY
EXPLODE(Result.Values) AS dp(Exploded);
@formatted =
SELECT
Result["Type"] AS _GEN__Type,
Result["Name"] AS Name,
Result["DatePicked"] AS DatePicked,
Result["ShelfLifeInDays"] AS ShelfLifeInDays,
Result["Calories"] AS Calories,
Result["Price"] AS Price,
MarketId,
MarketName
FROM @staging;
RETURN;
END;
/* ------------------------------------ Function BakedGoods ------------------------------------- */
DROP FUNCTION IF EXISTS BakedGoodsFunction;
CREATE FUNCTION IF NOT EXISTS BakedGoodsFunction(@table CategoryType)
RETURNS @formatted BakedGoods
AS BEGIN
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
@staging =
SELECT
CodeGenerationSample.Json.JsonFunctions.JsonTuple(BakedGoods) AS Result,
MarketId,
MarketName
FROM @table;
@staging =
SELECT
CodeGenerationSample.Json.JsonFunctions.JsonTuple(Exploded) AS Result,
MarketId,
MarketName
FROM @staging
CROSS APPLY
EXPLODE(Result.Values) AS dp(Exploded);
@formatted =
SELECT
Result["DateMade"] AS DateMade,
Result["Name"] AS Name,
Result["Price"] AS Price,
MarketId,
MarketName
FROM @staging;
RETURN;
END;
/* ------------------------------------ Function RestaurantItems ------------------------------------- */
DROP FUNCTION IF EXISTS RestaurantItemsFunction;
CREATE FUNCTION IF NOT EXISTS RestaurantItemsFunction(@table CategoryType)
RETURNS @formatted RestaurantItems
AS BEGIN
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
@staging =
SELECT
CodeGenerationSample.Json.JsonFunctions.JsonTuple(RestaurantItems) AS Result,
MarketId,
MarketName
FROM @table;
@staging =
SELECT
CodeGenerationSample.Json.JsonFunctions.JsonTuple(Exploded) AS Result,
MarketId,
MarketName
FROM @staging
CROSS APPLY
EXPLODE(Result.Values) AS dp(Exploded);
@formatted =
SELECT
Result["Name"] AS Name,
Result["Price"] AS Price,
Result["PreparationTimeInMinutes"] AS PreparationTimeInMinutes,
MarketId,
MarketName
FROM @staging;
RETURN;
END;
Now what we have are 3 queries.- Pull the JSON into a SQLMap from the original table @table
- Explode the Array and Parse each JSON item in the array
- Then create an instance of the resulting table type.
Now to use this you will create a script such as the following to get the produce.
REFERENCE ASSEMBLY master.[CodeGenerationSample.Json];
REFERENCE ASSEMBLY master.[Newtonsoft.Json];
@categories =
EXTRACT Produce string,
BakedGoods string,
RestaurantItems string,
MarketId string,
MarketName string
FROM "/Samples/Data/MarketLocations.json"
USING new CodeGenerationSample.Json.JsonExtractor("MarketLocations[*]", "SalesCategories.Produce", "SalesCategories.BakedGoods", "SalesCategories.RestaurantItems", "MarketId", "MarketName");
@produce =
SELECT *
FROM ProduceFunction
(
@categories
) AS V;
OUTPUT @produce
TO "/Samples/Output/produce.csv"
USING Outputters.Csv(outputHeader : true);
So we first need to create the categories table type that is fed into the ProduceFunction in the second query. Then we can use a SELECT * from the function to avoid specifying all the column names. This may be valuable to you to avoid specifying these over and over again and although we don't have a ton of properties here in this model, one could envision a huge data set where we did and then we would save ourselves a lot of work by using code generation. Sample Code
I have included sample code that provides everything that you can use to observe the code generation. That really isn't the hard part here. The difficulty is putting it all together. I have the example code at the following GitHub repository. https://github.com/kotvisbj/UsqlCodeGenerationSample.