Is pandas now faster than data.table?

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?

edited Nov 1 '18 at 15:11

oW_

3,151730

asked Oct 25 '17 at 2:43

xiaodai

14816

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

edited Nov 1 '18 at 15:11

oW_

3,151730

asked Oct 25 '17 at 2:43

xiaodai

14816

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

edited Nov 1 '18 at 15:11

oW_

3,151730

asked Oct 25 '17 at 2:43

xiaodai

14816

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

python r pandas data data.table

edited Nov 1 '18 at 15:11

oW_

3,151730

asked Oct 25 '17 at 2:43

xiaodai

14816

edited Nov 1 '18 at 15:11

oW_

3,151730

asked Oct 25 '17 at 2:43

xiaodai

14816

edited Nov 1 '18 at 15:11

oW_

3,151730

edited Nov 1 '18 at 15:11

oW_

3,151730

edited Nov 1 '18 at 15:11

oW_

3,151730

asked Oct 25 '17 at 2:43

xiaodai

14816

asked Oct 25 '17 at 2:43

xiaodai

14816

asked Oct 25 '17 at 2:43

xiaodai

14816

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

That's a really bad reason to switch to python.

– Matthew Drury
Oct 25 '17 at 3:47

@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?

– xiaodai
Oct 25 '17 at 4:31

Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.

– Matthew Drury
Oct 25 '17 at 4:52

you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.

– xiaodai
Oct 25 '17 at 6:46

You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).

– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

3 Answers
3

active

oldest

votes

A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).

We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

Has anyone done any benchmarks?

Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question              | data.table| pandas|

|-------:|:---------------------|----------:|------:|

|   1e+07|sum v1 by id1         |      0.140|  0.414|

|   1e+07|sum v1 by id1:id2     |      0.411|  1.171|

|   1e+07|sum v1 mean v3 by id3 |      0.574|  1.327|

|   1e+07|mean v1:v3 by id4     |      0.252|  0.189|

|   1e+07|sum v1:v3 by id6      |      0.595|  0.893|

|   1e+08|sum v1 by id1         |      1.551|  4.091|

|   1e+08|sum v1 by id1:id2     |      4.200| 11.557|

|   1e+08|sum v1 mean v3 by id3 |     10.634| 24.590|

|   1e+08|mean v1:v3 by id4     |      2.683|  2.133|

|   1e+08|sum v1:v3 by id6      |      6.963| 16.451|

|   1e+09|sum v1 by id1         |     15.063|     NA|

|   1e+09|sum v1 by id1:id2     |     44.240|     NA|

|   1e+09|sum v1 mean v3 by id3 |    157.430|     NA|

|   1e+09|mean v1:v3 by id4     |     26.855|     NA|

|   1e+09|sum v1:v3 by id6      |    120.376|     NA|

In 4 out of 5 questions data.table is faster, and we can see it scales better.

Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.

Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

And of course you are welcome to provide feedback in project repo!

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

1413

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered 6 mins ago

DonQuixote

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

answered Apr 25 '18 at 12:41

Tobias Krabel

18113

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.

– Stephen Rauch
Apr 25 '18 at 13:30

As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.

– Tobias Krabel
Apr 25 '18 at 18:23

"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.

– xiaodai
Apr 25 '18 at 22:18

I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?

– Tobias Krabel
Apr 26 '18 at 7:29

"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?

– smci
Aug 2 '18 at 18:15

|
show 1 more comment

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question              | data.table| pandas|

|-------:|:---------------------|----------:|------:|

|   1e+07|sum v1 by id1         |      0.140|  0.414|

|   1e+07|sum v1 by id1:id2     |      0.411|  1.171|

|   1e+07|sum v1 mean v3 by id3 |      0.574|  1.327|

|   1e+07|mean v1:v3 by id4     |      0.252|  0.189|

|   1e+07|sum v1:v3 by id6      |      0.595|  0.893|

|   1e+08|sum v1 by id1         |      1.551|  4.091|

|   1e+08|sum v1 by id1:id2     |      4.200| 11.557|

|   1e+08|sum v1 mean v3 by id3 |     10.634| 24.590|

|   1e+08|mean v1:v3 by id4     |      2.683|  2.133|

|   1e+08|sum v1:v3 by id6      |      6.963| 16.451|

|   1e+09|sum v1 by id1         |     15.063|     NA|

|   1e+09|sum v1 by id1:id2     |     44.240|     NA|

|   1e+09|sum v1 mean v3 by id3 |    157.430|     NA|

|   1e+09|mean v1:v3 by id4     |     26.855|     NA|

|   1e+09|sum v1:v3 by id6      |    120.376|     NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

1413

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question              | data.table| pandas|

|-------:|:---------------------|----------:|------:|

|   1e+07|sum v1 by id1         |      0.140|  0.414|

|   1e+07|sum v1 by id1:id2     |      0.411|  1.171|

|   1e+07|sum v1 mean v3 by id3 |      0.574|  1.327|

|   1e+07|mean v1:v3 by id4     |      0.252|  0.189|

|   1e+07|sum v1:v3 by id6      |      0.595|  0.893|

|   1e+08|sum v1 by id1         |      1.551|  4.091|

|   1e+08|sum v1 by id1:id2     |      4.200| 11.557|

|   1e+08|sum v1 mean v3 by id3 |     10.634| 24.590|

|   1e+08|mean v1:v3 by id4     |      2.683|  2.133|

|   1e+08|sum v1:v3 by id6      |      6.963| 16.451|

|   1e+09|sum v1 by id1         |     15.063|     NA|

|   1e+09|sum v1 by id1:id2     |     44.240|     NA|

|   1e+09|sum v1 mean v3 by id3 |    157.430|     NA|

|   1e+09|mean v1:v3 by id4     |     26.855|     NA|

|   1e+09|sum v1:v3 by id6      |    120.376|     NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

1413

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question              | data.table| pandas|

|-------:|:---------------------|----------:|------:|

|   1e+07|sum v1 by id1         |      0.140|  0.414|

|   1e+07|sum v1 by id1:id2     |      0.411|  1.171|

|   1e+07|sum v1 mean v3 by id3 |      0.574|  1.327|

|   1e+07|mean v1:v3 by id4     |      0.252|  0.189|

|   1e+07|sum v1:v3 by id6      |      0.595|  0.893|

|   1e+08|sum v1 by id1         |      1.551|  4.091|

|   1e+08|sum v1 by id1:id2     |      4.200| 11.557|

|   1e+08|sum v1 mean v3 by id3 |     10.634| 24.590|

|   1e+08|mean v1:v3 by id4     |      2.683|  2.133|

|   1e+08|sum v1:v3 by id6      |      6.963| 16.451|

|   1e+09|sum v1 by id1         |     15.063|     NA|

|   1e+09|sum v1 by id1:id2     |     44.240|     NA|

|   1e+09|sum v1 mean v3 by id3 |    157.430|     NA|

|   1e+09|mean v1:v3 by id4     |     26.855|     NA|

|   1e+09|sum v1:v3 by id6      |    120.376|     NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

1413

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question              | data.table| pandas|

|-------:|:---------------------|----------:|------:|

|   1e+07|sum v1 by id1         |      0.140|  0.414|

|   1e+07|sum v1 by id1:id2     |      0.411|  1.171|

|   1e+07|sum v1 mean v3 by id3 |      0.574|  1.327|

|   1e+07|mean v1:v3 by id4     |      0.252|  0.189|

|   1e+07|sum v1:v3 by id6      |      0.595|  0.893|

|   1e+08|sum v1 by id1         |      1.551|  4.091|

|   1e+08|sum v1 by id1:id2     |      4.200| 11.557|

|   1e+08|sum v1 mean v3 by id3 |     10.634| 24.590|

|   1e+08|mean v1:v3 by id4     |      2.683|  2.133|

|   1e+08|sum v1:v3 by id6      |      6.963| 16.451|

|   1e+09|sum v1 by id1         |     15.063|     NA|

|   1e+09|sum v1 by id1:id2     |     44.240|     NA|

|   1e+09|sum v1 mean v3 by id3 |    157.430|     NA|

|   1e+09|mean v1:v3 by id4     |     26.855|     NA|

|   1e+09|sum v1:v3 by id6      |    120.376|     NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

1413

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

1413

answered Oct 31 '18 at 21:53

jangorecki

1413

answered Oct 31 '18 at 21:53

jangorecki

1413

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

What about JuliaDB?

– skan
Dec 16 '18 at 0:09

@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63

– jangorecki
Dec 17 '18 at 5:17

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered 6 mins ago

DonQuixote

New contributor

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered 6 mins ago

DonQuixote

New contributor

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered 6 mins ago

DonQuixote

New contributor

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered 6 mins ago

DonQuixote

New contributor

answered 6 mins ago

DonQuixote

New contributor

answered 6 mins ago

DonQuixote

answered 6 mins ago

DonQuixote

New contributor

DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki