Data in the cloud: little has changed, much has improved

By: Paul Muller, Tech Lead Data

Of all disciplines in IT, few have had as much impact from the public cloud as ‘Data Analytics’. With the vast amounts of on-demand an hyper-scalable computing power and storage available, things are possible with data today that nobody could dream of before. At the same time, many of the questions that need answering remain the same.

A typical Data Analytics pipeline starts with collecting data and storing it. Data is then prepared and stored in different formats so it can be put to use for end-users:

Data Analytics starts by collecting data from various sources, from batch-oriented to (near) real-time. Data is transformed, cleansed and stored. In the preparation phase, data is organized into various forms needed for its actual consumption: loaded into Data Warehouses, NoSQL databases, Machine Learning or AI models or just plain flat files for Big Data analysis using specialized tools. By now, data has been transformed into information, it can be applied by analysts and scientist using their tools of choice. Applications can serve data, dashboards and reports can be refreshed with the latest insights. With cloud, this has not really changed. What has changed is the way we perform these actions, and the scale that the cloud infrastructure allows us to do our Data Analytics.

Collect: Unlimited storage allows for hoarding of data without the need to throw away anything. Compared to on-premises storage, expanding storage is fast and easy. But as with almost everything that brings you more flexibility, it forces you to make more choices too. With cloud, you have to choose between ‘hot’ and ‘cold’ storage, local and interregional replication, and more (and don’t forget to check whether your bucket or container is set to ‘public’). All this flexibility shows in costs for data in the cloud too; predicting costs is a challenge. All the choices you make have impact on those costs, but also usage patterns: if you have millions of small files and you access them frequently, the API calls involved start to add up. Special options are available to store more data cheaper, but if you delete data before the agreements end, there is a penalty. In other words, data lifecycle management, data security and data usage optimization are still things something to consider, even though the cloud allows you to postpone many decisions.

Most often part of this ‘collect’ phase, here are many tools allowing you to apply (real-time) transformation and cleansing to your data. Many of them are cloud native. They integrate nicely with different data sources and data sinks, and offer great flexibility. Setting them up using the browser, or even using Infrastructure as Code, as we prefer to do at Digital Survival Company, is fast and easy, while pay per use give you great flexibility. But this is where the actual work starts: once the (cloud) infrastructure is there, you have to start thinking about optimal ways to get data from source to destination. Don’t forget to apply GDPR/AVG regulations to your data, there is no difference between cloud and on-premises here. Especially when the scale goes up, you start to notice (and start being billed for) suboptimal configuration. Inefficient code can cause slower and more expensive processing of data, so optimizing logic is as important as it ever was.

Prepare: scaling up, scaling out and decommissioning unused resources, all these things are made incredibly easy using serverless compute. This is where cloud really shines. It is possible to normalize and aggregate the data that was collected and transformed into a relational database where you can use your favorite dashboarding tool to access it. With the growth of the dataset, the performance tier of the database can go up accordingly (or only during the busy hours). All this without having to invest in a  server based on the predicted peak usage. But when data becomes really big, different technologies can easily be added into one big (but not monolithic) platform. Optimized subsets of data are stored in the aforementioned relational database, other sets in a NoSQL database for API access through a webapp. Or split into many smaller files for massive parallel processing by tools like Spark. The possibilities are endless. But again a warning: all these specialized tools may be very easy to start up and get going, picking the right tool for the job is as important as ever, and optimizing logic and rightsizing of Compute can save a lot of time and money. Just like the on-premises days, when computing power was limited, it is wise not to spend too much of the unlimited cloud resources by simply brute-forcing your way to results.

Apply: After all these preparations, the actual information that was hidden in all that data can be used to deliver some value. So far, everything has been remarkably similar to earlier days of data processing. But now the sheer volume of data, stored in various formats and combined with the power of readily available and easy to use cloud based applications allow for improved and faster workflows. Information that was hidden in vast sees of data becomes visible because of all the computing power that can process it. Machine Learning and Artificial Intelligence models create new insights in the past, and better predictions for the future. With streaming analytics, business can even do real-time predictions, optimizing processes really fast so they can minimize waste and optimize results.

With cloud, a lot in Data Analytics has not really changed. From a technical perspective this is good. It means that engineers and developers can still apply much of the knowledge they already have. While at the same time having the opportunity to learn new things. Things that were just not possible before. Analysts and scientists have many more options unlocking the information for data consumers. Having available and creating more possibilities than ever before.

So I guess that even though little has changed, perhaps even more really has improved. If you would like to discuss your data challenges, or want to know what I or Digital Survival Company can do for you, feel free to leave me a message via the Let’s Connect contactform in the top menu. We can help you set up your cloud resources, fully automated, integrating data and tools to provide you with your own custom ‘Data Analytics as a Service’.

Kind regards,
Paul Muller

Tech Lead of Data
Digital Survival Company