What is data science and how to become one for free

Introduction


Data science meant to combine techniques from computer science (programming, algorithm design, and machine learning) and mathematics (statistics, probability, and algebra) to extract insights/information from data. It is, however, arguable which method belongs to which field (e.g., is machine learning a field of mathematics or computer science, it sure has a fair bit of math involved in it). There are three main steps in most data science projects:

  1. Exploration: Data science projects are usually started by an exploratory phase, where the data are explored to understand which variable is related to which one and how. This step, in addition to providing insights to the business, also provides a good basic understanding to the data scientist on what approach to use for the next step, that is usually a mathematical modeling. This phase may require heavy input from subject matter experts to understand why data is the way it is.
  2. Modeling: There are two main next steps for the exploratory phase that are deploying the data analytics findings through visualization tools or extending the finding by mathematical modeling. A model is an approximation of reality, given the historical data, that is designed based on what was learned in the exploratory phase. A model can be used to understand states that have not been experienced before (non-existence in the historical data). While this covers many data science projects, some other types (e.g., clustering) might be a bit different.
  3. Deployment: The final step is of course deployment of the models/insights and using a real-time stream of the data, that is a very challenging step. Detailed discussion on this is out of the scope of this essay.


What is so special about data science?

There are benefits to each step of the process mentioned above.

Benefits of exploration and data analytics: Well, one point I personally see in using data analytics in any industry is about the difference between "common sense" and "common practice". Common practice is the current practice in place in an organization to accomplish a task. Such practices are formed during many years of experience with trial and errors. But if such practices make sense or not is another question, meaning that common sense might actually be very different from common practice. Common practices are difficult to be spotted and change because of psychological attachments people develop toward them. Data, however, does not lie. It is just an emotionless reflection of the reality. Data science, data analytics, in particular, provides tools to extract insights from the data to support or reject common practices.

Benefits of modeling: Models provide an approximated, computerized, reality. As it is computerized, one can use it to estimate what was not in the data (e.g., what-if analysis). This is very useful for collecting a knowledge about things that we don't know (e.g., what happens in future, what is there in a space where we did not sample). Predicting what is going to happen in the future would help the business to become more proactive to events than reactive (see figure).


Left: An event happens, a monitoring system reports it, and a reaction is launched, Right: An action is predicted and launched before the event happens



One attractive point about data science is its current salary range, that is quite high comparing to other jobs.

What do you need to become a data scientist


There are two ways to become a data scientist: Education and experience. Even in the case of education, still, the experience is of significant importance. Let's just focus on how to become a data scientist by experience.

To establish anything, from a science to a construction, you need infrastructure. To become a data scientist, as defined earlier, you also need infrastructure.

  1. Good knowledge in at least one programming language: A programming language is essentially the platform where you implement your data science knowledge to solve a problem. Python, R, Scala, Java, C, and even Matlab are viable choices, however, Python and R are the most popular ones in this area. They provide many libraries that make life easier for a data scientist.
  2. A bit of statistics and mathematics background: Although you don't need to be a mathematician, you need a basic understanding of topics like statistical tests, vectors, and matrices.
  3. Understanding essentials: Data science problems are usually related to one of the following topics: supervised learning (from regression to NLP), unsupervised learning (from k-means to Adversial Neural Nets), dimensionality reduction (from PCA and ICA to Autoencoders), data characterization (from feature engineering to feature formation), and statistical differences. You don't need to know all existing methods and topics in all of these headlines, but you need to know something in each. More importantly, you need to know how to program these and use them.
  4. Tools: Cloud computing these days has become a trend (a useful one, to be honest) that is a plus if you have. 


Domain knowledge is another topic you may find mentioned on many websites as a requirement for a data science job. While important, I personally believe that a data scientist should be able to pick up the domain knowledge rather quickly in the job.

If you demonstrate these, you already have what it takes to be named as a "data scientist".

What does a hiring manager look for in your CV

As a hiring manager, I have been looking for some standard points to short-list applicants for a data science job:

  1. Programming skills: This is the first point I usually look for to hire a data scientist. If you don't know how to program, you cannot develop your ideas, that is essential to deliver projects.
  2. Education: Education is very important, list it in the CV and try to relate it to the job and data science at the same time. Note that you do not need to have an education in mathematics or statistics or computer science to be a good data scientist. You just need to demonstrate how you have had some courses related to data science. If you had a Bachelor in psychology, highlight your understanding of statistics that is unique to psychologists. However, the further your field is from data science-related fields, the more you need in your "showcases". 
  3. Showcases: I look for projects in data science you have done before. They can be small (out of curiosity), competitions (kaggle), or large as a part of another job. 
  4. Benefits: It is very important that, for each project in the showcase, you also demonstrate the benefit of that project. That could be as simple as learning a specific method in that project as that method was a perfect fit to adding value to a business, with a nice way of calculating the money-value of the deliverables. This is particularly important in businesses as they usually prioritize their projects and your skills in understanding projects benefits come handy.
  5. What you say should make sense: If a candidate claims their education is Bachelor in Chemistry and they designed a two-dimensional LSTM to model time and space at the same time, I would not believe it. Not that they are not telling the truth, but it is unlikely that that project was done properly if it was really done by that person alone. But if that candidate claims that they solved this and that problem using a method in sklearn, I would believe it and trust the person. 
  6. Domain knowledge: If you have some knowledge about the job you are applying for, make sure you include that knowledge in your resume. Subject matter knowledge is very useful to deliver projects successfully.
Note that I would never search for education in data science or mathematics in a CV to hire a data scientist. While that can be a factor, showcases are way more important.

How to become a data scientist for free
You can become a data scientist for free, no Data Camp, no Coursera, no expensive education needed. You need to strengthen the abilities mentioned before.

1) Install python (anaconda): Install Anaconda.
2) To learn relevant programming skills, I suggest to use this channel and complete the courses on "NumPy Data Science Essential Training With Python 3" and "Python for data science training".
3) Learn machine learning with Python, this channel has what you want, see "Machine learning with python".
4) Use this cheat sheet and this page actively.


My last, probably the most important point: Do not start with deep learning. Deep learning is the state-of-the-art in the area of Machine Learning, but it has limitations and might be very deceptive. It is very useful, but starting from it may lead to a misconception of it can solve everything. If an applicant in their CV says they are expert in any deep learning method, while there is no name of linear regression, random forest, feature engineering, or other classic methods, I would not read that CV any further.

Note, however, when the basics were understood, the next natural step is, of course, the deep learning (including Autoencoders, Convolutional Neural Networks, Generative Adversial Neural Networks, to name but a few).

Like always, this is a point of discussion, therefore, comments are welcome.



By R.B.









Comments

  1. Register now to enjoy the advantages of Artificial Intelligence Training in Hyderabad and make you’re a bright career in this field by joining the AI Patasala.
    Data Scientist Training in Hyderabad

    ReplyDelete

Post a Comment

Popular Posts