Analysing Data

From GEST-S482 Digital Business
Jump to navigation Jump to search

Introduction

Corporate analytics (or business intelligence) is about using the data that you produce and that you ingest from outside and make it useful by transforming it and presenting it to the business analysts in a way that makes sense for them (and that is easy to use).

Dimensional Models

The whole field of traditional business intelligence created a number of concepts and techniques so that businesspeople can access data in the format they will intuitively understand. Traditionally, the process was organized around 3 phases:

  • The ETL, for Extract-Transform-Load,
  • The Semantic Layer, usually a data warehouse,
  • And the Presentation Layer, usually data cubes.

ETL Layer

This phase involves extracting data from a source (often what are called transactional systems, as opposed to analytical systems, or even from the web), transforming them to standardized them, and finally, load and store them in the data infrastructure.

Semantic layer

The goal of an analytics system is to answer business questions. In order to do so, your data must have meaning. Meaning is usually built by providing context to datapoints. A semantic layer is a part of a database where data is embedded in context. In this case, data is usually organized in a dimensional model. In such a model, numerical values are gathered in tables called fact tables. Besides these values, fact tables only contain foreign keys corresponding to points in dimensions.

There are different database structures in a dimensional model. A starschema is a simple schema where all the dimensions are rolled up into one table. It means that the fact table is connected through the foreign keys to dimensions that are one level deep. This kind of schema is very easy to query because there are at most x joins to include inside our SQL query. However, you will have a lot of repetitions inside the database. It is very good for querying but maintaining it is a challenge (cf. updates). A snowflake is a middle term between a pure star schema and a completely normalized database (cf. transactional systems). In a snowflake, we have still one fact table (if we have more, it is called constellations). But in this case, we have a hierarchy inside the dimensions.

The structure of a query in a dimensional model will always be the same:

  • The SELECT part will always be a mix of unaggregated dimensions fields and aggregated measures
  • The FROM part will always be between the fact table and dimensions table and sometimes (in the snowflake) between dimensions table in the same hierarchy.
  • The WHERE is a filter on one or more dimensions
  • The GROUP BY will always be along the dimensions of the unaggregated fields in the SELECT

Presentation Layer

This layer can take many forms: dashboards in Tableau, SQL terminals or, OLAP cube. OLAP, for On-Line Analytical Processing, is a set of pre-computed dimensional queries, aggregated at each level of the hierarchy and served to the spreadsheets of the analysts.

Collective Intelligence

Collective Intelligence is the name given to the technical side of mass customization. Collective intelligence comes in many forms and shapes. The huge amount of data collected by companies, however questionable it might be, is not all bad. For instance, it allows companies such as Netflix to recommend personalized films to customers.

Imagine having a dataset containing movie reviews by many people (about 6000). Thanks to the Singular Value Decomposition, we suppose that we have a matrix containing the ratings with every row representing an individual and every column a movie. The intuition behind it is to concentrate into a small volume a huge dimension of people who were spread out in space, just like compressing something. This will concentrate the variance or the information so that people in a close proximity give and have similar reviews. However, not every person in the dimension gave a review about each film. Therefore, the empty matrix cells will be filled in with look-alike models.

The issue with singular vectors is that too many assumptions are made. They extrapolate values to empty cells making every value revolve around a baseline. The results will therefore be similar for people who you might think like the same movies without taking individual characteristics into account. A recommender system can be built though assuming that people with similar taste will continue to like the same films always. As shown in the example, the five top rated movies for a viewer coincided with the five film recommendations given to the same viewer: 3 out of 5 films were the same.

This is when Statistics and Computer Science come into the picture, in an era known as Modern Analytics and the concept of AI.

Modern Analytics - The Next Frontier

In the 70's, the approach to artificial intelligence was based on the manipulation of symbols or concepts. Researchers were mostly trying to make machine “learn” like one learns mathematics: Start with axioms, build some theorem, use the result of this theorem to build other theorems, rince and repeat. As it turns out, this did not work so great. Despite early successes, there was a long time that artificial intelligence was considered old news (the so-called AI Winter) and interest faded. But interest was found again, and AI came back under the form of statistical machine learning.

While statistics had existed for a couple of century, the specific type of statistics useful for the tasks AI was asked to solve and the some of the required technique was out of the reach of the mathematician. Indeed, they required fast computers and, in many case, the ability to perform matrix computation very fast. Bizarrely, one of the reason we now have those tools is because of the gaming community. Since one of the most important requirement to have good 3D graphics is also matrix computation, manufacturer started to build increasingly faster Graphical Processing Units for prices that were accessible even for university researchers, who in turn used them to implement algorithms and concepts that had existed for a long time but were not really used because they were impractical.

While traditional Machine Learning techniques based on symbolic computation do not require data, those new algorithms needed much data to “learn”. But what do machine learn ? They learn mathematical relations. And in order to have such relation, you need models.Much like modern programming languages, such as Python, are built on the older, lower levels languages such as Assembler, but work at a higher level of abstraction, modern analytics algorithms are able to manipulate concepts and represent relations that you will typically find in our today’s mental models. However, they depend in large part on the concepts created in the last iteration of the economy of information: if one does not have an ERP or some Management Information System, such as a CRM, one won’t have the data required to figure out which product one can sell to a specific consumer to maximize one's profit.If you are not able to retrieve, join and agregate data in a large, structured database, having killer algorithms won’t do you much good. If your business processes are not somehow modelled and abstracted, you won’t be able to improve them and expose them to your business partners and customers to make them nimble and adaptive. And if you’re not on the internet today, you’re not in business at all.

Overview of techniques and goals

In order to do modern analytics and to take part to the Machine Learning revolution, one can leverage the fact that the technical parts are already done and made available by a community under the form of librarires, packages, etc. Today, the ability to perform complex and analytical tasks is not conditioned on one's ability to code them by oneself (because it has already been done by someone), but on one's ability to formulate the problem in a way that makes it easy to apply the tools that you have to them.

So, where are we in the analytics journey? In order to illustrate that, we are going to use what is called the ladder of causation, which is a concept developed by the AI researcher Judea Pearl (if we want to read about that, this is explained in his book "The Book of Why"). Conceptually, Pearl explains that there are 3 steps into having a real understanding machine or AI:

  • Observation: right now, with the advent of Neural Network and this kind of algorithms, we kind of know how to predict stuff when we observe them. So we can find correlation and represent correlation pretty accurately. In terms of techniques, neural network is at at one hand of the current data analytics toolbox because it is very advanced and opaque. Basically, neural network is a mathematical model where you have your input (the input layer), and you have to figure out a bunch of ways through which we need to pass our input (you can have several layers of neurons). At one hand, we have this current data analytics toolbox. On the other hand, we have models that are very well understood and very interpretable which are for instance the decisions trees etc. And in the middle we have probabilistic models that are partly understandable but also flexible enough so that they can represent very complex correlation between sets of variables. This is the general toolbox right now. Those models currently are gears towards understanding what we can observe. But we don't manipulate them yet.
  • Intervention: This layer can be interpreted as the following sentence: "if I change this, what is going to happen to that ? ". Basically, you try to change something and you see what happens. And you try to understand and identify the model or the problem that you have. Right now, this is something that is at the frontier of researching in machine learning. "If I change this, how can I figure out how that is going to react ?"
  • Counterfactual: "If I had done this, what would have happen to that ?". Once we can answer this kind of questions, we will have machines that can imagine situations. This is not a very established field right now.

Right now, we are able to understand and implement the observation and the intervention but we currently are a little further from implementing the counterfactual layer. We are at the beginning of the journey, we don't know how to represent yet all the causal links and the causal mechanisms that lead to a variation.


Link to the assistant web page

Where to go?

Main page Exercises - Next Session Physical Threat