@sambodhi 2017-03-20T02:11:23.000000Z 字数 8027 阅读 2124

The Most Popular Language For Machine Learning Is ...


Jean-François Puget博士分享了他的观点,阐述了机器学习和数据科学中都流行哪些语言,并阐述了在机器学习和数据科学应该选择哪门语言的原则。

Jean-François Puget博士居住在法国Saint Raphael,是IBM的杰出工程师,从事机器学习和优化。

Jean-François Puget博士写了一篇《机器学习最流行的语言是哪门?》文章,经作者授权,InfoQ翻译并分享。以下是正文。

What programming language should one learn to get a machine learning or data science job? That's the silver bullet question. It is debated in many forums. I could provide here my own answer to it and explain why, but I'd rather look at some data first. After all, this is what machine learners and data scientists should do: look at data, not opinions.


So, let's look at some data. I will use the trend search available on indeed.com. It looks for occurrences over time of selected terms in job offers. It gives an indication of what skills employers are seeking. Note however that it is not a poll on which skills are effectively in use. It is rather an advanced indicator of how skill popularity evolve (more formally, it is probably close to the first order derivative of popularity as the latter is the difference of hiring skills plus retraining skills minus retiring and leaving skills).



Enough speaking, let's get data. I searched for skills used in conjunction with "machine learning" and "data science", where skills are one of the prominent programming language Java, C, C++, and Javascript. I also included Python and R which we know are popular for machine learning and data science, as well as Scala given its link to Spark, and Julia that some think is the next big thing here. Running this query we get the data we are looking for:

好了,让我们看数据吧。我搜索了与“机器学习”和“数据科学”结合使用的技能,其中比较显眼的编程语言是 Java、C、C++和 JavaScript。我还加上了 Python和 R,因为这两者是机器学习和数据科学的流行语言。还有 Scala(与 Spark相关)、Julia(被认为是下一代热门语言)。经查询之后,我们得到结果如下图。

When we focus on machine learning, we get similar data:


What can we derive from this data?
First of all, we see that one size does not fit all. A number of languages are fairly popular in this context.


Second, there is a sharp increase of popularity for all these, reflecting the increased interest in machine learning and data science over the last few years.


Third, Python is the clear leader, followed by Java, then R, then C++. Python lead over Java is increasing, while the lead of Java over R is decreasing. I must admit I have been surprised to see Java at the second place, I was expecting R instead.

第三,Python明显领先,紧随其后的依次是 Java、R、C++。Python超越 Java并逐渐拉开差距,与此同时,Java与 R之间的差距逐渐减小。我必须承认,我看到 Java位居第二其实很惊讶,希望 R能取而代之。

Fourth, Scala growth is impressive. It was almost non existent 3 years ago, and is now in the same ballpark as more established languages. This is easier to spot when we switch to the relative view of the data on indeed.com:



Fifth, Julia popularity is not anywhere near the other, but there is definitely an uptick in the recent months. Will Julia turn in one of the popular languages for machine learning and data science? Future will tell.


If we ignore Scala and Julia in order to be able to zoom on the other languages growth, then we confirm that Python and R grow faster than general purpose languages.

如果我们忽略 Scala和 Julia,以便放大其他语言增长的视图,那么可以确定的是,Python和 R的增长速度超过了其他语言。

It maybe that R popularity will pass that of Java soon given the difference in growth rate.When we focus on deep learning with this query, the data is quite different:

从这条曲线来看,也许 R的流行度将会很快超过 Java。当我们换成“深度学习”来进行查询时,得到的数据就大为不同。

There, Python is still the leader, but C++ is now second, then Java, and C at fourth place. R is only at the fifth rank. There is clearly an emphasis on high performance computing languages here. Java is growing fast though. It could reach second place soon, as for machine learning in general. R isn't going to be near the top anytime soon. What surprises me is the the absence of Lua, although it is used in one of the major deep learning frameworks (Torch). Julia isn't present either.

在这个查询中,Python依旧领先,但紧随其后的依次是 C++、Java、C。R只排在第五名。这里显然强调这些是高性能计算语言。虽说 Java在快速增长,但在机器学习中,大体可能排在第二位。R并不会很快到达顶端。令我惊讶的是,这个榜单竟然没有Lua,虽然它是一个主要深度学习框架( Torch)的语言。 而且Julia也不在榜上。

The answer to the original question should now be clear. Python, Java, and R are most popular skills when it comes to machine learning and data science jobs. If you want to focus on deep learning rather than machine learning in general, then C++, and to some lesser extent C, are also worth considering. Remember however, that this is only one way of looking at the problem. You may get a different answer if you are looking for a job in academia, or if you just want to have fun learning about machine learning and data science during your spare time.

最初那个问题的答案现在应该很清楚了吧?在机器学习和数据科学的工作中,Python、Java和 R是最流行的语言。如果你的精力集中在深度学习而非一般的机器学习,那么就是 C++,其次是 C,也是值得考虑的。然而要记住,这只是看问题的一个途径。如果你想寻找学术界的工作,或者只是想在业余时间学习机器学习和数据科学,那么你可能会得到不同的答案。

What about my personal answer? I answered earlier this year in this blog. Besides having support from many top machine learning frameworks, Python is good fit for me because I have a computer science background. I would also feel comfortable with C++ for developing new algorithms, given I've programmed in that language for most of my professional life. But this is me, and people with different background may feel better with another language. A statistician with limited programming skills will certainly prefer R. A strong Java developer can stay with his favorite language as there are significant open sources with Java api. And a case can certainly be made for any of the languages on these charts.

至于我个人的答案,我今年早些时候在博客上做出了回答。除了能够支持许多主流机器学习框架,Python对我而言很适合,只是因为我有计算机科学背景而已。我也喜欢使用 C++开发新的算法,因为我的大部分职业生涯中都用 C++编程。但这只是我个人情况,不同专业背景的人可能会觉得另一门语言更好些。编程技能有限的统计学家会更青睐 R。有实力的 Java开发者可以继续使用他钟爱的 Java,因为 Java有数量可观的开源 Java API。同理,其他语言也如此。

Therefore, my advice would be to read other blogs discussing the same question before investing significant time in learning a language.


原文作者: Jean-François Puget
原文链接: The Most Popular Language For Machine Learning Is …


Everytime a "what's the best language for Machine Learning/Data Science?" thread pops up, it always devolves into a flame war between "real data analysts use R" and "Python has thousands more libraries!". (most recent example 16 days ago: https://news.ycombinator.com/item?id=13110230 )

My response is always use-the-most-appropriate-tool-for-the-job-dammit and don't pigeon-hole yourself into one language, since each language has their pros and cons. I am very tempted to write an HN autocomment bot at this point. (in Python instead of R, of course, since that's the most appropriate tool for this job)


每次讨论“机器学习/数据科学最好的语言是什么”的时候,总是引发一场口水战:“真正的数据分析师使用 R”、“Python拥有极为丰富的库!”。点这里就能看到有关 R和 Python粉丝对骂的水贴。而我的观点是,要为工作选择最适合的工具,而不要让自己陷入语言之争,因为每种语言都有其长处和短处。

You can write and test classifiers in about 5 lines of Python using scikit-learn.

The other advantage of Python is that as a scripting language it's very powerful for data wrangling and pre-processing, without needing all the boilerplate that e.g. C++ would require.


Python的另一个优点是,它作为脚本语言,有非常强大的数据整理( data wrangling)和数据预处理( data pre-processing),而不需要所有的样板文件( C++就需要)。

After a year of experiments i realized that machine learning and big data pipelines are inseparable.So at first i was thinking R/Python is the greatest.And it might be until you need to do more that a few isolated models. At that point i reverted to building the pipeline parts with Spring Java + InMemory DataGrids because there is so many options.


经过一年的实验,我意识到,机器学习和大数据管道(big data pipelines )是不可分割的。因此,我一开始就认定 R和 Python是最伟大的语言。

Torch is a great library, and Lua a fine language, but they are competing against ecosystems built on the world's largest languages.



I use a tiny bit of python, a little more LUA, and a TON of C++ in the machine learning work I do. Things like opencv, fbthrift, folly, boost, fblualib, and thpp make writing this sort of code in C++ very time efficient and if you know what you are doing it will end up performing much better than the alternatives. I only use python for some light scripting, data collating and reformatting type tasks, and LUA due to using Torch as my Neural network framework of choice.


我在机器学习工作中使用了一点python、一点点LUA和大量的 C++。像opencv、fbthrift、folly、boost、fblualib和thpp这样的代码,在 C++编写这种类型的代码非常高效。如果你知道自己在做什么,那么它将最终表现比那些选择更好。我只有在轻量级脚本、数据整理、重新格式化类型的任务使用Python。由于Torch是我的神经网络框架,因此LUA成了我的选择。