数据科学家必须精通的十项统计技术
来源:网络大数据 作者:编辑-张曹 分类:聚焦物联 热度:

无论大家身处怎样的行业或学科背景,数据技术的重要性都值得你认真考量。那么,我们该如何掌握数据科学知识?尽管编程能力确实重要,但数据科学家事实上更需要将编程、统计学与批判性思维结合起来。更具体地讲,要胜任数据科学职务,统计技能将必不可少。

1. 线性回归

2. 分类

3. 重采样方法

4. 子集选择

5. 收缩

6. 维度约简

7. 非线性模型

8. 基于树的方法

9. 支持向量机

10. 无监督学习

大数据

原文标题:The 10 Statistical Techniques Data Scientists Need to Master

Regardless of where you stand on the matter of Data Science sexiness, it’s simply impossible to ignore the continuing importance of data, and our ability to analyze, organize, and contextualize it. Drawing on their vast stores of employment data and employee feedback, Glassdoor ranked Data Scientist #1 in their 25 Best Jobs in America list. So the role is here to stay, but unquestionably, the specifics of what a Data Scientist does will evolve. With technologies like Machine Learning becoming ever-more common place, and emerging fields like Deep Learning gaining significant traction amongst researchers and engineers — and the companies that hire them — Data Scientists continue to ride the crest of an incredible wave of innovation and technological progress.

While having a strong coding ability is important, data science isn’t all about software engineering (in fact, have a good familiarity with Python and you’re good to go). Data scientists live at the intersection of coding, statistics, and critical thinking. As Josh Wills put it, “data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” I personally know too many software engineers looking to transition into data scientist and blindly utilizing machine learning frameworks such as TensorFlow or Apache Spark to their data without a thorough understanding of statistical theories behind them. So comes the study of statistical learning, a theoretical framework for machine learning drawing from the fields of statistics and functional analysis.

Why study Statistical Learning? It is important to understand the ideas behind the various techniques, in order to know how and when to use them. One has to understand the simpler methods first, in order to grasp the more sophisticated ones. It is important to accurately assess the performance of a method, to know how well or how badly it is working. Additionally, this is an exciting research area, having important applications in science, industry, and finance. Ultimately, statistical learning is a fundamental ingredient in the training of a modern data scientist. Examples of Statistical Learning problems include:

Identify the risk factors for prostate cancer.

Classify a recorded phoneme based on a log-periodogram.

Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements.

Customize an email spam detection system.

Identify the numbers in a handwritten zip code.

Classify a tissue sample into one of several cancer classes.

Establish the relationship between salary and demographic variables in population survey data.

In my last semester in college, I did an Independent Study on Data Mining. The class covers expansive materials coming from 3 books: Intro to Statistical Learning (Hastie, Tibshirani, Witten, James), Doing Bayesian Data Analysis(Kruschke), and Time Series Analysis and Applications (Shumway, Stoffer). We did a lot of exercises on Bayesian Analysis, Markov Chain Monte Carlo, Hierarchical Modeling, Supervised and Unsupervised Learning. This experience deepens my interest in the Data Mining academic field and convinces me to specialize further in it. Recently, I completed the Statistical Learning online course on Stanford Lagunita, which covers all the material in the Intro to Statistical Learning book I read in my Independent Study. Now being exposed to the content twice, I want to share the 10 statistical techniques from the book that I believe any data scientists should learn to be more effective in handling big datasets.

Before moving on with these 10 techniques, I want to differentiate between statistical learning and machine learning. I wrote one of the most popular Medium posts on machine learning before, so I am confident I have the expertise to justify these differences:

Machine learning arose as a subfield of Artificial Intelligence.

Statistical learning arose as a subfield of Statistics.

Machine learning has a greater emphasis on large scale applications and prediction accuracy.

Statistical learning emphasizes models and their interpretability, and precision and uncertainty.

But the distinction has become and more blurred, and there is a great deal of “cross-fertilization.”

Machine learning has the upper hand in Marketing!

1 — Linear Regression:

精英物联网-物联网大数据前沿科技信息资讯网,内容只代表作者观点,如有侵权请联系站务处理。

上一篇:2018年这七个科技话题最火 你还不知道就落伍啦 下一篇:没有了
猜你喜欢
各类观点
热门排行
精彩图文