Is pandas now faster than data.table?












5












$begingroup$


https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?










share|improve this question











$endgroup$








  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52








  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04
















5












$begingroup$


https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?










share|improve this question











$endgroup$








  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52








  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04














5












5








5





$begingroup$


https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?










share|improve this question











$endgroup$




https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?







python r pandas data data.table






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 1 '18 at 15:11









oW_

3,151730




3,151730










asked Oct 25 '17 at 2:43









xiaodaixiaodai

14816




14816








  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52








  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04














  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52








  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04








5




5




$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47




$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47




1




1




$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31




$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31




1




1




$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52






$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52






1




1




$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46




$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46












$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04




$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04










3 Answers
3






active

oldest

votes


















8












$begingroup$

A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



Setup



We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.




  • Data retrieval with a select-like operation

  • Data filtering with a conditional select operation

  • Data sort operations

  • Data aggregation operations


The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



Results in a nutshell





  • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


  • pandas is faster at filtering rows (roughly 50% on average)


  • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

  • adding a new column appears faster with pandas

  • aggregating results are completely mixed


Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






share|improve this answer











$endgroup$













  • $begingroup$
    A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
    $endgroup$
    – Stephen Rauch
    Apr 25 '18 at 13:30










  • $begingroup$
    As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
    $endgroup$
    – Tobias Krabel
    Apr 25 '18 at 18:23










  • $begingroup$
    "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
    $endgroup$
    – xiaodai
    Apr 25 '18 at 22:18










  • $begingroup$
    I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
    $endgroup$
    – Tobias Krabel
    Apr 26 '18 at 7:29










  • $begingroup$
    "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
    $endgroup$
    – smci
    Aug 2 '18 at 18:15





















4












$begingroup$


Has anyone done any benchmarks?




Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



To not just link the content you are asking for I am pasting recent timings for those solutions.



| in_rows|question              | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|


In 4 out of 5 questions data.table is faster, and we can see it scales better.

Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

And of course you are welcome to provide feedback in project repo!






share|improve this answer











$endgroup$













  • $begingroup$
    What about JuliaDB?
    $endgroup$
    – skan
    Dec 16 '18 at 0:09










  • $begingroup$
    @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
    $endgroup$
    – jangorecki
    Dec 17 '18 at 5:17



















0












$begingroup$

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



See feather's github page





share








New contributor




DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    8












    $begingroup$

    A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



    We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



    EDIT:

    If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



    Setup



    We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.




    • Data retrieval with a select-like operation

    • Data filtering with a conditional select operation

    • Data sort operations

    • Data aggregation operations


    The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



    Results in a nutshell





    • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


    • pandas is faster at filtering rows (roughly 50% on average)


    • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

    • adding a new column appears faster with pandas

    • aggregating results are completely mixed


    Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






    share|improve this answer











    $endgroup$













    • $begingroup$
      A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
      $endgroup$
      – Stephen Rauch
      Apr 25 '18 at 13:30










    • $begingroup$
      As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
      $endgroup$
      – Tobias Krabel
      Apr 25 '18 at 18:23










    • $begingroup$
      "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
      $endgroup$
      – xiaodai
      Apr 25 '18 at 22:18










    • $begingroup$
      I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
      $endgroup$
      – Tobias Krabel
      Apr 26 '18 at 7:29










    • $begingroup$
      "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
      $endgroup$
      – smci
      Aug 2 '18 at 18:15


















    8












    $begingroup$

    A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



    We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



    EDIT:

    If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



    Setup



    We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.




    • Data retrieval with a select-like operation

    • Data filtering with a conditional select operation

    • Data sort operations

    • Data aggregation operations


    The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



    Results in a nutshell





    • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


    • pandas is faster at filtering rows (roughly 50% on average)


    • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

    • adding a new column appears faster with pandas

    • aggregating results are completely mixed


    Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






    share|improve this answer











    $endgroup$













    • $begingroup$
      A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
      $endgroup$
      – Stephen Rauch
      Apr 25 '18 at 13:30










    • $begingroup$
      As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
      $endgroup$
      – Tobias Krabel
      Apr 25 '18 at 18:23










    • $begingroup$
      "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
      $endgroup$
      – xiaodai
      Apr 25 '18 at 22:18










    • $begingroup$
      I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
      $endgroup$
      – Tobias Krabel
      Apr 26 '18 at 7:29










    • $begingroup$
      "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
      $endgroup$
      – smci
      Aug 2 '18 at 18:15
















    8












    8








    8





    $begingroup$

    A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



    We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



    EDIT:

    If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



    Setup



    We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.




    • Data retrieval with a select-like operation

    • Data filtering with a conditional select operation

    • Data sort operations

    • Data aggregation operations


    The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



    Results in a nutshell





    • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


    • pandas is faster at filtering rows (roughly 50% on average)


    • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

    • adding a new column appears faster with pandas

    • aggregating results are completely mixed


    Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






    share|improve this answer











    $endgroup$



    A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



    We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



    EDIT:

    If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



    Setup



    We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.




    • Data retrieval with a select-like operation

    • Data filtering with a conditional select operation

    • Data sort operations

    • Data aggregation operations


    The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



    Results in a nutshell





    • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


    • pandas is faster at filtering rows (roughly 50% on average)


    • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

    • adding a new column appears faster with pandas

    • aggregating results are completely mixed


    Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Apr 26 '18 at 7:45

























    answered Apr 25 '18 at 12:41









    Tobias KrabelTobias Krabel

    18113




    18113












    • $begingroup$
      A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
      $endgroup$
      – Stephen Rauch
      Apr 25 '18 at 13:30










    • $begingroup$
      As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
      $endgroup$
      – Tobias Krabel
      Apr 25 '18 at 18:23










    • $begingroup$
      "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
      $endgroup$
      – xiaodai
      Apr 25 '18 at 22:18










    • $begingroup$
      I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
      $endgroup$
      – Tobias Krabel
      Apr 26 '18 at 7:29










    • $begingroup$
      "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
      $endgroup$
      – smci
      Aug 2 '18 at 18:15




















    • $begingroup$
      A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
      $endgroup$
      – Stephen Rauch
      Apr 25 '18 at 13:30










    • $begingroup$
      As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
      $endgroup$
      – Tobias Krabel
      Apr 25 '18 at 18:23










    • $begingroup$
      "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
      $endgroup$
      – xiaodai
      Apr 25 '18 at 22:18










    • $begingroup$
      I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
      $endgroup$
      – Tobias Krabel
      Apr 26 '18 at 7:29










    • $begingroup$
      "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
      $endgroup$
      – smci
      Aug 2 '18 at 18:15


















    $begingroup$
    A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
    $endgroup$
    – Stephen Rauch
    Apr 25 '18 at 13:30




    $begingroup$
    A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
    $endgroup$
    – Stephen Rauch
    Apr 25 '18 at 13:30












    $begingroup$
    As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
    $endgroup$
    – Tobias Krabel
    Apr 25 '18 at 18:23




    $begingroup$
    As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
    $endgroup$
    – Tobias Krabel
    Apr 25 '18 at 18:23












    $begingroup$
    "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
    $endgroup$
    – xiaodai
    Apr 25 '18 at 22:18




    $begingroup$
    "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
    $endgroup$
    – xiaodai
    Apr 25 '18 at 22:18












    $begingroup$
    I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
    $endgroup$
    – Tobias Krabel
    Apr 26 '18 at 7:29




    $begingroup$
    I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
    $endgroup$
    – Tobias Krabel
    Apr 26 '18 at 7:29












    $begingroup$
    "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
    $endgroup$
    – smci
    Aug 2 '18 at 18:15






    $begingroup$
    "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
    $endgroup$
    – smci
    Aug 2 '18 at 18:15













    4












    $begingroup$


    Has anyone done any benchmarks?




    Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

    Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



    To not just link the content you are asking for I am pasting recent timings for those solutions.



    | in_rows|question              | data.table| pandas|
    |-------:|:---------------------|----------:|------:|
    | 1e+07|sum v1 by id1 | 0.140| 0.414|
    | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
    | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
    | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
    | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
    | 1e+08|sum v1 by id1 | 1.551| 4.091|
    | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
    | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
    | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
    | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
    | 1e+09|sum v1 by id1 | 15.063| NA|
    | 1e+09|sum v1 by id1:id2 | 44.240| NA|
    | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
    | 1e+09|mean v1:v3 by id4 | 26.855| NA|
    | 1e+09|sum v1:v3 by id6 | 120.376| NA|


    In 4 out of 5 questions data.table is faster, and we can see it scales better.

    Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



    Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

    And of course you are welcome to provide feedback in project repo!






    share|improve this answer











    $endgroup$













    • $begingroup$
      What about JuliaDB?
      $endgroup$
      – skan
      Dec 16 '18 at 0:09










    • $begingroup$
      @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
      $endgroup$
      – jangorecki
      Dec 17 '18 at 5:17
















    4












    $begingroup$


    Has anyone done any benchmarks?




    Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

    Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



    To not just link the content you are asking for I am pasting recent timings for those solutions.



    | in_rows|question              | data.table| pandas|
    |-------:|:---------------------|----------:|------:|
    | 1e+07|sum v1 by id1 | 0.140| 0.414|
    | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
    | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
    | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
    | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
    | 1e+08|sum v1 by id1 | 1.551| 4.091|
    | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
    | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
    | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
    | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
    | 1e+09|sum v1 by id1 | 15.063| NA|
    | 1e+09|sum v1 by id1:id2 | 44.240| NA|
    | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
    | 1e+09|mean v1:v3 by id4 | 26.855| NA|
    | 1e+09|sum v1:v3 by id6 | 120.376| NA|


    In 4 out of 5 questions data.table is faster, and we can see it scales better.

    Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



    Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

    And of course you are welcome to provide feedback in project repo!






    share|improve this answer











    $endgroup$













    • $begingroup$
      What about JuliaDB?
      $endgroup$
      – skan
      Dec 16 '18 at 0:09










    • $begingroup$
      @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
      $endgroup$
      – jangorecki
      Dec 17 '18 at 5:17














    4












    4








    4





    $begingroup$


    Has anyone done any benchmarks?




    Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

    Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



    To not just link the content you are asking for I am pasting recent timings for those solutions.



    | in_rows|question              | data.table| pandas|
    |-------:|:---------------------|----------:|------:|
    | 1e+07|sum v1 by id1 | 0.140| 0.414|
    | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
    | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
    | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
    | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
    | 1e+08|sum v1 by id1 | 1.551| 4.091|
    | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
    | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
    | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
    | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
    | 1e+09|sum v1 by id1 | 15.063| NA|
    | 1e+09|sum v1 by id1:id2 | 44.240| NA|
    | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
    | 1e+09|mean v1:v3 by id4 | 26.855| NA|
    | 1e+09|sum v1:v3 by id6 | 120.376| NA|


    In 4 out of 5 questions data.table is faster, and we can see it scales better.

    Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



    Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

    And of course you are welcome to provide feedback in project repo!






    share|improve this answer











    $endgroup$




    Has anyone done any benchmarks?




    Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

    Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



    To not just link the content you are asking for I am pasting recent timings for those solutions.



    | in_rows|question              | data.table| pandas|
    |-------:|:---------------------|----------:|------:|
    | 1e+07|sum v1 by id1 | 0.140| 0.414|
    | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
    | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
    | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
    | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
    | 1e+08|sum v1 by id1 | 1.551| 4.091|
    | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
    | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
    | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
    | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
    | 1e+09|sum v1 by id1 | 15.063| NA|
    | 1e+09|sum v1 by id1:id2 | 44.240| NA|
    | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
    | 1e+09|mean v1:v3 by id4 | 26.855| NA|
    | 1e+09|sum v1:v3 by id6 | 120.376| NA|


    In 4 out of 5 questions data.table is faster, and we can see it scales better.

    Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



    Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

    And of course you are welcome to provide feedback in project repo!







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 1 '18 at 14:37

























    answered Oct 31 '18 at 21:53









    jangoreckijangorecki

    1413




    1413












    • $begingroup$
      What about JuliaDB?
      $endgroup$
      – skan
      Dec 16 '18 at 0:09










    • $begingroup$
      @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
      $endgroup$
      – jangorecki
      Dec 17 '18 at 5:17


















    • $begingroup$
      What about JuliaDB?
      $endgroup$
      – skan
      Dec 16 '18 at 0:09










    • $begingroup$
      @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
      $endgroup$
      – jangorecki
      Dec 17 '18 at 5:17
















    $begingroup$
    What about JuliaDB?
    $endgroup$
    – skan
    Dec 16 '18 at 0:09




    $begingroup$
    What about JuliaDB?
    $endgroup$
    – skan
    Dec 16 '18 at 0:09












    $begingroup$
    @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
    $endgroup$
    – jangorecki
    Dec 17 '18 at 5:17




    $begingroup$
    @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
    $endgroup$
    – jangorecki
    Dec 17 '18 at 5:17











    0












    $begingroup$

    I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



    See feather's github page





    share








    New contributor




    DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$


















      0












      $begingroup$

      I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



      See feather's github page





      share








      New contributor




      DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$
















        0












        0








        0





        $begingroup$

        I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



        See feather's github page





        share








        New contributor




        DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



        See feather's github page






        share








        New contributor




        DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.








        share


        share






        New contributor




        DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 6 mins ago









        DonQuixoteDonQuixote

        1




        1




        New contributor




        DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        DonQuixote is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ponta tanko

            Tantalo (mitologio)

            Erzsébet Schaár