On Generation of Test Data, or a Comfort Zone for your Tests

Pavel Yutish

Imagine the following situation: you come to the office once, check the result of the nightly test run, and get a completely red report. You dig a little deeper and discover the reason – all the test data you had carefully created are gone without a trace, along with the rest of the database contents. Developers happened to have an urgent need to clear old data on the test environment, but this news somehow remained unknown to you.

In order not to fall into despair preparing the same set of test data every month, you begin to look for ways to make your life easier. The most logical solution is to get the test data generated on their own, while you can attend to some more interesting tasks. Why not? All we need now is to make the idea work. The article goes on about how you can really make this idea work.


To start

So, we have decided to implement automatic data generation for our autotests. Where do we start?  

To avoid reinventing the bicycle, it’s best to choose the most optimal approach and equip yourself with existing best practices. What kind of best practices? It is quite unlikely that you will google out a “Free Super Tool” that someone has already created with you in mind and that will work like a silver bullet for your project and solve all your issues with it. Here we are turning to the existing design patterns that will help us correctly implement our data generation module, which can then be easily implemented in the test framework. But before we discuss patterns, it is worth getting to know the basic concepts.

  • Data – based on the object-relational model, we interpret data as a collection of objects and relations between them.
  • In turn, Objects are certain entities with a set of fields of different types that can have some restrictions.
  • Relations define dependencies between objects and can also impose restrictions on manipulations with them (for example, when creating a child object is impossible without specifying its primary object).
  • Data representation is a description of the structure of objects, their fields, and relations. They can have different formats: XML, JSON, YAML, etc. We will further generate our data based on the representation model.

Now we are approaching the most interesting part: let’s assume we have to implement data generation at the program level, based on the data representation - a tree-like hierarchical structure with a description of elementary objects and their interrelations. We should have a possibility to get the entire tree and to access its individual parts later on. We will also need not only the operations of creation / deletion, but also modifications of one or several objects at a time. This is where we resort to the help of design patterns, namely the Builder and Composite patterns. 


Build and Compose

  • Builder pattern is a generating design pattern that provides an interface for creating a composite (complex) object and separates the logic of constructing an object from its representation.
  • Composite pattern is a partitioning design pattern that combines objects into a tree-like structure to represent a part-whole hierarchy. It allows customers to address both one object and a group of objects at a time via a single interface.

In a nutshell, by using the Builder pattern we implement generation, "construction", of a complex object, while the Composite pattern provides us with an interface to manipulate the elements of an object tree.


How does it work?

Suppose, we have to generate the following set of objects and relations:

  • Market - the root object, from which all others are derived. There may be a lot of markets, and they never overlap. It contains such fields as market_id, name, country, is_active.
  • Company can exist in only one market, whereas a market can contain numerous companies, names of which must be unique within one market. It contains the fields company_id, market_id, name.
  • Channel can exist in only one company, whereas a company can contain numerous channels whose names must be unique. There is also a possibility to include an option of publishing posts both to the current channel and to any other channel in the current market. It contains the fields channel_id, company_id, state, is_secret, secret_phrase, can_share, share_to, main_image, name.
  • Post is published in a channel, and one channel can contain numerous posts. The post title does not have to be unique. It contains the fields post_id, channel_id, title, description, state, main_image.

Below you can find a graph of the data model, where the top 0 is a market, which contains two companies: 1 and 2. Company 1 contains two published channels – 10 and 4 with posts within (5-9). Channel 10 contains an option to publish posts (4). Company 2 contains no channels.

Generating test data - Data model graph

To generate a set of live data based on this model, we have to write a builder class that will create necessary objects starting from the root of the graph and going down to the child objects. We also have to prepare a data representation model that will be used as a basis by our builder.

Generating Test Data - Bulder Class

Thus, if we have to create, for example, a new channel in the second company, we will have something like:

builder_instance.build_channel(data_model, name=’News’, company_id=company_id)

and our builder_instance will prepare a data model for the new channel, after filling in all the necessary fields, upload the picture to the backend, get its id and insert it into the main_image field, indicate, which company to link the channel to, and send a request to create an entity. We do not have to worry about all the implementation details - the main thing is that at the output we get channel_id / channel_name generated based on the data model we have described.

An outline of the model assembly: Market - Company - Channel - Post within the builder class

Generating test data - Model assembly



Now it's time to highlight the role of the Composite and its functions.

Suppose we can create a new test channel with one line, but what do we do when we have to create not one but, say, forty channels and give them all the parameter is_published = True?

This is when we need the Composite - to escape doing the same thing forty times. Using the model described above, let’s suppose we now have to publish all posts in a channel with a certain channel_id:


The Composite will select the channel with the specified id from the list created by the Builder, and in cycle will go through the list of posts in the given channel calling the following on each of them:


Generating test data - Composite


Thus, we avoid duplication, and in this case the Composite acts as an intermediary between the client code and the Builder, independently calling the necessary methods of the builder object depending on our needs.

Generating test data - Methods Calling


Some of you may find it excessive to add one more level of abstraction with the Composite, as all the logic can be implemented directly in the Builder. However, if you are planning to use the Builder as a reusable component later on, for example, to write API or DB tests on its basis, the absence of an additional layer will create unnecessary obstacles in the future.

The graphs given above contain one more component - Data Access Library. This is a level directly responsible for communication with the application - creating and saving data, for example, with the help of web API or directly in MS SQL. In this particular case, data generation is implemented with the help of the web API of the application. I used Python Requests as a library, which is a handy module based on urllib3 and allowing to build a query and get the answer in just a couple of lines. You can find out what Requests can do here. It may also happen that you need to generate random field values in a nice way. Here the library Faker will come handy. It was initially created for PHP, but was later ported to Python and Java. Faker can generate random names, surnames, addresses, emails, phone numbers, separate words, etc. You can read more about this library here, and if you are using Java – have a look here.


Advantages and Disadvantages of Approach


  • Independence of tests
  • Time saving
  • Easy support
  • Re-usability


  • How easy the support is greatly depends on the implementation



Of course, the described approach cannot claim to be the only true and universal one - it all depends on the specifics of the project and your needs. However, I believe you can use this model as a basis if you are pondering how you can optimize the process of preparing your tests and how you can make the tests more independent of external factors.

This way, if you once spend time on design and implementation, you will spare yourself from unnecessary costs in the future, reducing all work to small corrections and expansion of the set of data models if necessary.

Need an innovative and reliable tech partner?

Let's connect