Data validation with JSON Schema

According to the principles of Data-Oriented Programming, we should represent data with generic and immutable data structures, like immutable hash maps and immutable vectors. At first sight, it might seem that it means to live in the wild.

In fact, it is possible – and advised – to maintain a data schema in Data-Oriented Programming.

The major difference between this kind of data validation and the way data is validated with static types is that data schema should be separated from data representation.

We should separate data schema from data representation.

The purpose of this article is to explain the rationale behind this data validation approach.

This article is made of 4 parts:

How to express a data schema using JSON Schema
How to validate data against a JSON Schema
The benefits of separating between data schema and data representation
The costs of separating between data schema and data representation

This article is an interactive version of the article published on JavaScript Works.

Freedom

JSON Schema

Think about handling a request in a library management system for the addition of an author to the system. To keep things simple, imagine that such a request contains only basic information about the author:

Their first name
Their last name
Optionally, the number of books they have written

In Data-Oriented Programming, we represent the request data in our program as a string map that is expected to have three fields:

firstName - a string
lastName - a string
books - a number (optional)

Using JSON Schema, we represent the data schema of the request with the following map:

var addAuthorRequestSchema = {
  "type": "object",
  "required": ["firstName", "lastName"], 
  "properties": {
    "firstName": {"type": "string"},
    "lastName": {"type": "string"},
    "books": {"type": "integer"}
  }
};

A couple of remarks regarding the syntax of this JSON Schema:

Data is expected to be a map (in JSON, a map is called an object)
Only firstName and lastName fields are required
firstName must be a string
lastName must be a string
books must be an integer (when it is provided)

Data validation against a schema

In order to check whether a piece of data conforms to a data schema, we use a data validation library for our preferred programming language.

Language	Data validation Library
JavaScript	Ajv
Clojure	jinx
Java	Snow
C#	JSON.net Schema
Python	jschon
Ruby	JSONSchemer

The complete list of data validation libraries is available here.

For instance, in JavaScript, using Ajv JSON Schema validator, we validate a piece of data using the validate function. As you might expect, when a piece of data is valid, validate returns true:

var ajv = new Ajv({allErrors: true}); 

var validAuthorData = {
  firstName: "Isaac",
  lastName: "Asimov",
  books: 500
};

ajv.validate(addAuthorRequestSchema, validAuthorData);

When a piece of data is invalid (e.g. using lastNam instead of lastName), validate returns false:

var invalidAuthorData = {
  firstName: "Isaac",
  lastNam: "Asimov",
  books: "five hundred"
};

ajv.validate(addAuthorRequestSchema, invalidAuthorData);

When a piece of data is invalid, we can easily get details about data validation failures in a human readable format:

var invalidAuthorData = {
  firstName: "Isaac",
  lastNam: "Asimov",
  books: "five hundred"
};

ajv.validate(addAuthorRequestSchema, invalidAuthorData);
ajv.errorsText(ajv.errors);

A couple of remarks regarding validation with Ajv:

By default, Ajv stores only the first data validation error. We use allErrors: true to store all errors.
Data validation errors are stored internally as an array. In order to get a human readable string, we use errorsText function.

The benefits of separating between data schema and data representation

When we separate data schema from data representation in our programs, our programs benefit from:

Freedom to choose what data should be validated
Optional fields
Advanced data validation conditions
Automatic generation of data model visualization
Automatic generation of unit tests

Benefit #1: Freedom to choose what data should be validated

When data schema is separated from data representation we are free to instantiate data without specifying its expected shape. Such a freedom is useful in various situations. For example:

We want to experiment with code quickly
Data has already been validated

Rapid prototyping

In classic Object-Oriented Programming and in some statically typed Functional Programming, each and every piece of data must have a predefined shape (either a class or a data type). During the exploration phase of coding, where we don’t know yet what is the exact shape of our data, being forced to update the type definition each time we update our data model slows us down. In Data-Oriented Programming, we can develop at a fast pace during the exploration phase, by delaying the data schema definition to a later phase.

Code refactoring

One common refactoring pattern is the split phase refactoring where you basically split a single large function into multiple smaller functions, with a private scope. Those functions are called with data that has already been validated by the large function. In Data-Oriented Programming, we are free to not specify the shape of the arguments of the inner functions, relying on the data validation that has already occurred.

Suppose we want to display some information about an author, like their full name and whether they are considered as prolific or not.

First, we define the data schema for the author data:

var authorSchema = {
  "type": "object",
  "required": ["firstName", "lastName"],
  "properties": {
    "firstName": {"type": "string"},
    "lastName": {"type": "string"},
    "books": {"type": "integer"}
  }
};

Then, we write a displayAuthorInfo function that first check whether data is valid and then displays the information about he author:

function displayAuthorInfo(authorData) {
  if(!ajv.validate(authorSchema, authorData)) {
    throw "displayAuthorInfo called with invalid data";
  };
  console.log("Author full name is: ", authorData.firstName + " " + authorData.lastName);
  if(authorData.books == null) {
    console.log("Author has not written any book");
  } else {
    if (authorData.books > 100) {
      console.log("Author is prolific");
    } else {
      console.log("Author is not prolific");
    }
  }
}

Notice that the first thing we do inside the body of displayAuthorInfo is to validate that the argument passed to the function is valid.

Now, let’s apply the split phase refactoring pattern to this simplistic example and split the body of displayAuthorInfo in two inner functions:

displayFullName: Display the author full name
displayProlificity: Display whether the author is prolific or not


function displayFullName(authorData) {
  console.log("Author full name is: ", authorData.firstName + " " + authorData.lastName);
}

function displayProlificity(authorData) {
  if(authorData.books == null) {
    console.log("Author has not written any book");
  } else {
    if (authorData.books > 100) {
      console.log("Author is prolific");
    } else {
      console.log("Author is not prolific");
    }
  }
}

function displayAuthorInfo(authorData) {
  if(!ajv.validate(authorSchema, authorData)) {
    throw "displayAuthorInfo called with invalid data";
  };
  displayFullName(authorData);
  displayProlificity(authorData);
}

Having the data schema separated from the data representation allows us not to specify a data schema for the arguments of the inner functions displayFullName and displayProlificity. It makes the refactoring process a bit smoother.

In some cases, the inner functions are more complicated and it makes sense to specify a data schema for their arguments. Data-Oriented Programming gives us the freedom to choose!

Benefit #2: Optional fields

In Object-Oriented Programming, allowing a class member to be optional is not easy. For instance, in Java one needs a special construct like the Optional class introduced in Java 8.

In Data-Oriented Programming, it is natural to declare a field as optional in a map. In fact in JSON Schema, by default every field is optional. In order to make a field non-optional, we have to include its name in the required array as for instance in the author schema in the following code snippet where only firstName and lastName are required while books is optional.

var authorSchema = {
  "type": "object",
  "required": ["firstName", "lastName"], // `books` is not included in `required`, as it is an optional field
  "properties": {
    "firstName": {"type": "string"},
    "lastName": {"type": "string"},
    "books": {"type": "integer"} // when present, `books` must be an integer
  }
};

Let’s illustrate how the validation function deals with optional fields: A map without a books field is considered to be valid:

var authorDataNoBooks = {
  "firstName": "Yehonathan",
  "lastName": "Sharvit"
};

ajv.validate(authorSchema, authorDataNoBooks)

However, a map with a books field where the value is not an interger is considered to be invalid:

var authorDataInvalidBooks = {
  "firstName": "Albert",
  "lastName": "Einstein",
  "books": "Five"
};

ajv.validate(authorSchema, authorDataInvalidBooks)

Benefit #3: Advanced data validation conditions

In Data-Oriented Programming, data validation occurs at run time. It allows us to define data validation conditions that go beyond the type of a field. For instance, we might want to make sure that a field is not only a string but a string with a maximal number of characters or a number comprised in a range of numbers.

For instance, here is a JSON Schema that expects firstName and lastName to be strings of less than 100 characters and books to be a number between 0 and 10,000:

var authorComplexSchema = {
  "type": "object",
  "required": ["firstName", "lastName"],
  "properties": {
    "firstName": {
      "type": "string",
      "maxLength": 100
    },
    "lastName": {
      "type": "string",
      "maxLength": 100
    },
    "books": {
      "type": "integer",
      "minimum": 0,
      "maximum": 10000
    }
  }
};

JSON Schema supports many other advanced data validation conditions, like regular expression validation for string fields or number fields that should be a multiple of a given number.

Benefit #4: Automatic generation of data model visualization

When the data schema is defined as data, we can leverage tools that generate data model visualization: with tools like JSON Schema Viewer and Malli we can generate a UML diagram out of a JSON Schema. For instance, the following JSON Schema defines the shape of a bookList field that is an array of books where each book is a map.

{
  "type": "object",
  "required": ["firstName", "lastName"],
  "properties": {
    "firstName": {"type": "string"},
    "lastName": {"type": "string"},
    "bookList": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "publicationYear": {"type": "integer"}
        }
      }
    }
  }
}

The tools we just mentioned can generate the following UML diagram from the JSON Schema:

author schema

Benefit #5: Automatic generation of unit tests

An interesting way to leverage data schema for function arguments is to automatically generate unit tests. Let’s change a bit the displayAuthorInfo function into a function called calculateAuthorInfo that – instead of displaying the author info – returns the author info as a string.

function calculateAuthorInfo(authorData) {
  if(!ajv.validate(authorSchema, authorData)) {
    throw "calculateAuthorInfo called with invalid data";
  };
  var fullName = "Author full name is: " + authorData.firstName + " " + authorData.lastName;
  var prolificity = "";
  if(authorData.books == null) {
    prolificity = "Author has not written any book";
  } else {
    if (authorData.books > 100) {
      prolificity = "Author is prolific";
    } else {
      prolificity = "Author is not prolific";
    }
  }
  return fullName + "\n" + prolificity;
}

We are going to generate a unit test for calculateAuthorInfo by generating random input data that conforms to authorSchema. For that purpose, we use a library like JSON Schema Faker.

JSONSchemaFaker(authorSchema)

Then we call calculateAuthorInfo with the random data:

calculateAuthorInfo(JSONSchemaFaker(authorSchema))

Depending on what the function does, we might expect different things. In the case of calculateAuthorInfo, we expect the output to be a string that starts with the word Author. Let’s create a schema for the return value of calculateAuthorInfo:

var calculateAuthorInfoReturnValueSchema = {
"type": "string",
"pattern": "Author.*"
};

Here is the code of our unit test:

var res = calculateAuthorInfo(JSONSchemaFaker(authorSchema));
ajv.validate(calculateAuthorInfoReturnValueSchema, res)

The costs of separating between data schema and data representation

There is no such thing as a free lunch. Separating between data schema and data representation comes at a cost:

Loose connection between data and its schema
Light performance hit

Cost #1: Loose connection between data and its schema

By definition, when we separate between data schema and data representation, the connection between data and its schema is looser that when we represent data with classes. Moreover, the schema definition language (e.g. JSON Schema) is not part of the programming language. It is up to the developer to decide where data validation is necessary and where it is superfluous.

As the idiom says, with great power comes great responsibility.

Cost #2: Light performance hit

As we mentioned earlier, there exist implementations of JSON Schema validation in most programming languages. When data validation occurs at run time it takes some time to run the data validation while in Object-Oriented programming, data validation occurs usually at compile time.

This drawback is mitigated by the fact that even in Object-Oriented programming some parts of the data validation occur at run time. For instance, the conversion of a request JSON payload into an object occurs at run time. Moreover, in Data-Oriented Programming, it is quite common to have some data validation parts enabled only during development and to disable them when the system runs in production.

As a consequence, the performance hit is not significant.

Wrapping up

In Data-Oriented Programming, data is represented with immutable generic data structures. When additional information about the shape of the data is required, we are free to define a data schema (e.g. in JSON Schema).

Keeping the data schema separate from the data representation leaves the developer free to decide where and when data should be validated. Moreover, data validation occurs at run-time. As a consequence, we can express data validation conditions that go beyond the static data types (e.g. the string length).

However, as the idiom says, with great power comes great responsibility. It’s up to the developer to decide what data should be validated.

Yehonathan Sharvit

Data validation with JSON Schema

JSON Schema

Data validation against a schema

The benefits of separating between data schema and data representation

Benefit #1: Freedom to choose what data should be validated

Rapid prototyping

Code refactoring

Benefit #2: Optional fields

Benefit #3: Advanced data validation conditions

Benefit #4: Automatic generation of data model visualization

Benefit #5: Automatic generation of unit tests

The costs of separating between data schema and data representation

Cost #1: Loose connection between data and its schema

Cost #2: Light performance hit

Wrapping up

Read Next

A new way of blogging about Golang

Polymorphism without objects via multimethods

JSON Schema

Data validation against a schema

The benefits of separating between data schema and data representation

Benefit #1: Freedom to choose what data should be validated

Rapid prototyping

Code refactoring

Benefit #2: Optional fields

Benefit #3: Advanced data validation conditions

Benefit #4: Automatic generation of data model visualization

Benefit #5: Automatic generation of unit tests

The costs of separating between data schema and data representation

Cost #1: Loose connection between data and its schema

Cost #2: Light performance hit

Wrapping up

Subscribe to Yehonathan Sharvit newsletter

Read Next

Tags