What is Data Science and why do we need it?

Introduction

This article reviews what is data science, why do businesses need it, and why now is the time to adopt it.

What is data science?

Data science (see also this) is the science of dealing with data to extract patterns, which, eventually, are used to present information and making decisions. The following picture is a trendy Venn diagram defining what is this data science:

Image result for data science


While the picture is quite telling, it has a lot of details to be clarified. Obviously, a data scientist requires Computer Science, Mathematics, and Domain knowledge. But each of these, on its own, is a large and complicated field of study. I personally believe that the more you know in these areas the better you will be in solving data science-related problems. However, "what do we need from each of these to get started" is a fair question.

Mathematics provide methodologies required to design solutions for problems. Techniques like classification, clustering, statistical tests, and even neural networks have deep mathematical bases that enable their effectiveness in solving relevant problems. Computer science, however, provides tools to implement these techniques, ranging from programming languages, algorithmic design, complexity theory, software design and architecture, databases, cloud, among others. Clearly, skills from computer science and Mathematics provide tools to solve a problem, essentially what can we solve and how to solve a problem. These tools, however, do not provide any information about what would worth solving, essentially what needs to be solved and why to solve it. This information is provided by the domain knowledge (or subject matter experts, SMEs).

Why data science is needed?

The traditional model of analyzing data was through data analysts, who were spending months on manually (using tools like Excel, or SPSS if they were more skilled) analyzing data and extracting information. 


With the growth of the amount of data and variability in the modalities collected, this model can only be scaled by hiring more data analysts, which is expensive and ineffective. With more data and involvement from more people, the human error goes up, collaboration becomes more difficult, and information extracted become messier.
Image result for data analystImage result for data analyst... Image result for data analyst

An alternative is to ensure the business is more focused on providing better tools to extract higher-level information from the data rather than scaling with the amount of the raw data collected. This would equip data analysts and SME's to spend less time on "raw" data and more time on the "processed" data, which requires less time. Here comes in data scientists. Data scientists have the skills and tools to extract higher-level information from the raw data, without losing much of important information. Data science, at this stage, is the only feasible and scalable way to deal with a huge amount of data while still making sure the most informative knowledge is extracted and provided to the SME's.

Why now?

Pure Mathematics is always well ahead of other sciences simply because it does not require much of infrastructure (labs, equipment, etc.). When it is applied, however, the issue of infrastructure comes into place. For example, while "state-of-the-art" convolutional neural network is nearly 25 years old (see this), it was only successful in around 2015 when it could be effectively implemented on GPU and showed great potential in vision tasks. Of course new advances have happened ever since, still, most used methods in data science are linear regression, support vector machines, among others. So, why data science has become so popular only recently if the knowledge already existed? Here are some of the reasons.


  • Maturity of methods: With the internet, the methods invented by mathematicians are shared more rapidly and publicized better, which makes them more accessible for whoever interested. In the meantime, many people simplify the methodologies for better understanding. 
  • Cheap data collection: Sensors and electronic devices have become much more affordable than before. The cost of sensors is drastically reducing, in 2004 the average cost of sensors was $1.30 and in the year 2020, it is expected to come down to $0.38 (see this). 
Hard Drive Disc Comparison
Size of hard-drives in years. See this
  • Cheap data storage: The prices of storage hard drives per GB per year has dropped significantly. This enables easier data storage, hence, more data can be collected.
  • Hard Drive Cost per GB over time
    Cost of data storage per GB in time. See this
  • Accessibility of data: Internet is one of the most important factors that has increased accessibility to information significantly. This is required for a data scientist.
  • Much more compute power: Convolutional neural networks were not publicized until 2012 when they were finally implemented on effective hardware, a massively parallelizable idea on GPU. Mainframes have been replaced by very accessible clusters and clouds. Processing power per CPU core has significantly increased (see Moore's law). Apollo Guidance Computer which guided the space-craft to the moon had a CPU with 2 MhZ of clock and 4KB of RAM! Way way less power compared to our phones today. See this.

And we are still far behind the processing, memory, and storage power of the human brain:

Computer Brain Exascale
How are we doing compared to nature? See this.




Summary:


  1. What is data science: Data science solves problems using Mathematics and Computer science-based tools, which the need for the solution comes from Business domain knowledge, hence, effective in adding value to the industry.
  2. Why data science is needed: Lots of data is collected every day, which may contain valuable information. Hiring more data analysts is not economical nor scalable. Data scientists can provide solutions that abstract the data and extract higher-level information from the raw data, to be used by the SMEs more effectively.
  3. Why now is the time: Compute power (processors, GPU, and clusters), the maturity of methods (Mathematics), storage (Hard drives, Clouds), accessibility of data (Internet), and access to cheap sensors are main drivers. None of these have been as readily available as they are today. So, now is the time to use data science

I would be glad to discuss any topic in the area of data science. Please leave your opinion below.



By R.B.

Comments

Popular Posts