Testing applications around big data – why you should plan ahead

To misquote Douglas Adams’ Hitchhiker’s Guide to the Galaxy, big data is big. Really big. You might think you have lots of data, but that’s just peanuts to big data. To quote another Doug – Doug Laney of Gartner – big data involves having so much data that it becomes hard to manage via traditional methods due to the variety, volume, and velocity of data being created. Since this second definition was created in 2001, data technologies have evolved rapidly alongside the volumes of data that companies need to handle.

Big data was once the domain of only the largest enterprises and web-scale organisations, but it has now become more popular and useful for businesses of all sizes. This has led to more companies building applications and services that make use of big data, and a need for companies to adequately test these setups. So how can you approach this testing requirement correctly?

Planning your big data approach

 It’s essential to think bigger than just implementing a big data cluster. It’s useful to consider new technologies that are used to handle unstructured data, but it’s equally necessary to look at them in context. This means that – from a testing perspective – you will have to think bigger than individual elements for performance. Instead, testing should involve understanding the requirements that have led your organisation to implement a service that requires big data in the first place.

Alongside this, it’s equally important to consider how you will support your data over time before you take the plunge into implementing any new technology. This involves working through several stages.

The first question should be whether your data will be structured, unstructured, or both? Many big data implementations start off by collecting large volumes of unstructured data. This will equate to deploying Apache Hadoop as a way to store that huge volume of data over time. However, while Hadoop is popular for reducing the cost of storing data, it has not been easy to get value out of the data after it is stored.

Consequently, there has been a swing toward making more use of structured data and combinations of structured and unstructured data. This approach requires advance planning in order to define data schemas and determine how this information will be captured; however, it offers you an opportunity to clarify how and why this data will be used in the future.

Relational data management

Defining the use case for data in this instance involves looking at how much data will be coming in, how it will be organised, and what relationships will exist between schemas. This can direct whether you should be using a relational approach to storing data or a non-relational database.

Relational data management approaches normally involve using relational databases such as MySQL or PostgreSQL. These databases are most suited to managing data in a standard table format based on rows and columns and will support standard query languages such as SQL.  MongoDB is a non-relational solution. It is a document store – where a document is a JSON string – and benefits from features such as sharding, which is used to scale data size, and write volume for handling large amounts of data.

An efficient model that’s commonly used in big data implementations is for data to initially be inserted onto an eventing pipeline such as Apache Kafka, and have different components in the pipeline determine, based on data context, where the data should go. In most cases, all data ends up in Apache Hadoop, but before it arrives data which is likely to need to be queried ends up in Online Transactional Processing (OLTP) databases or Online Analytical Processing (OLAP) databases.

A basic example of this is queryable logs for audit reporting. Logs are generated via a syslog endpoint; rsyslog is used and writes onto an Apache Flume endpoint. Flume uses a splitting sink that sends log data to ElasticSearch if it is required to be immediately searchable and everything goes to Apache Kafka. The side of the sink dealing with immediately searchable log data leads to ElasticSearchLogStashSerializer (ElasticSearch uses date-based purging to keep itself truncated).

The other side of the sink leads to Apache Kafka and because the logs are tokenizeable they can be used easily for event triggers, and the pipeline terminates at a Hadoop cluster. This is a common setup which incorporate ELK and Big Data.

 Designing for big data instances

 Alongside the technology choices, there is another set of criteria that big data implementations will have to be judged by – performance boundaries in the application itself. These are defined by the components of the application, from the code through to the data access layer, on to the data management and storage elements, and the requirements for the application itself.

Big data applications fall into one of two categories – operational and analytical. Operational applications cover those systems that have to handle data at the moment it is created. Examples of this include an eCommerce site recommending another product to a shopper while they are browsing the site, or a fraud detection algorithm using data to spot potentially bad transactions. These actions must occur while a transaction is taking place.

Conversely, analytical applications use data to provide in-depth insight into data sets and to find patterns in behaviour. This is normally created by linking different sets of data from multiple databases or data sources, and then using this overall set of information for analysis. This takes place on a scheduled basis, and unlike an operational transaction, real-time results are not as important. While an operational application may have to return results in less than a second, an analytical application can work over five or ten seconds to crunch the data and provide results.

By understanding the performance requirement, you can understand the expected behaviour. To improve performance, it can be worth looking at the technologies involved, from caching data through to making use of an in-memory database like Redis to store information for faster use. This can also help your big data implementation scale up over time, as better performance allows you to handle more queries in the same timeframe.

Time for testing

 Once the preparation work has been carried out, it is time for you to implement a proof of concept (POC) project to demonstrate the model and how it meets requirements. Building this infrastructure is essential, as testing a big data implementation involves looking at a mix of different moving parts from software and code through to the database and data management side. These different elements can all affect each other, so it is impossible to look at them out of context. Equally, the POC deployment can demonstrate where an assumption for the requirements is inaccurate.

This upfront work is essential to ensure big data deployments are successful, as it is hard to test individual elements on their own and get an accurate read on how well these components will perform in production. The sheer volume of data that can be involved in big data deployments can make testing hard, while the use of multiple tools and technologies to integrate databases and data analytics tools together further complicates the scenario.

It is therefore crucial to spend time on requirements-gathering and turn these into functional end-to-end test scenarios. It is only by implementing end-to-end testing approaches that you can check an application meets its requirements.

To make this work in practice, you must consider your approach to observability and how you will scale up over time. Taking an approach like canary deployments – where a smaller subset of customers and / or data is moved into production to check that performance and results come through as expected before a full production roll-out – can help massively here, as this should prove that the environment is working as expected. Similarly, taking data from across the application can provide more granularity on service performance.

Observability is particularly important for big data applications as these systems tend to be distributed rather than held in one place. Implementing a big data system in one data centre is possible, but many developers manage distributed systems that either has to run across multiple locations or run across a mix of cloud services. From a toolset perspective, it is worth looking at DevOps tools that cover applications as well as storage or database instances. DevOps teams have a plethora of options for observability – from Java-focused tools like New Relic through to log management options like Spunk and Sumo Logic. On the database side, there are fewer options available, but Percona Monitoring and Management can monitor many open source databases.

Performance should also not be the only goal. It may seem strange to think that the goal for big data services shouldn’t solely be the volume of data handled, but the consistency of performance is equally important. Being able to deliver consistency at scale is a critical benchmark for big data, as this demonstrates where you can scale up in a predictable way for all the characteristics of a given workload.

Testing and big data – thinking bigger

 Testing plays an essential role in services built on big data. Rather than the specific goals and checks associated with more traditional software testing for code, looking at big data involves a lot more preparation and planning. Getting involved early in looking at requirements and process design – essentially, knowing what the key goals are and what the business needs from them – can help testing teams guide deployments and achieve a successful outcome.

This emphasis on integrating functional testing and proof of concept design together means that more preparation is needed upfront compared to traditional software development timelines. It is worth taking the time to work with business stakeholders and mutually agree on the requirements which determine how the whole application will meet their goals in practice. It is far better to spend this time in advance, preparing and testing at the POC stage, rather than rushing into production deployments. Getting this right is necessary, as it is much harder to migrate services once they are in place, or when the wrong technology choice has been made. While it is possible to integrate existing solutions into an updated process and work around previous choices, this can take far longer than starting from scratch.

By putting in the dedicated groundwork, your big data approach is more likely to successfully deliver what your business demands. Testing also often means you avoid costly mistakes based on bad assumptions. Testing can bring together the whole team, ensuring that all stakeholders – from developers and IT operations teams through to business users – are aligned on the same goals, and can deliver applications designed and built to handle the huge volumes of data that companies expect to see in the future.

Written by Dimitri Vanoverbeke, Senior Solutions Engineer & Tyler Duzan, Product Manager at Percona