Create new data frames from existing data frame based on unique column values
$begingroup$
I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id
(string) and company_score
(float). There are approximately 10,000 unique company_id
's.
company_id company_score date_submitted company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW
My goal is to create approximately 10,000 new dataframes, by unique company_id
, with only the relevant rows in that data frame.
The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.
company_dictionary = {}
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()
Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?
Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.
[In] unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]
C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):
C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality
C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res
C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]
C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
python pandas dataframe
$endgroup$
add a comment |
$begingroup$
I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id
(string) and company_score
(float). There are approximately 10,000 unique company_id
's.
company_id company_score date_submitted company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW
My goal is to create approximately 10,000 new dataframes, by unique company_id
, with only the relevant rows in that data frame.
The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.
company_dictionary = {}
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()
Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?
Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.
[In] unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]
C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):
C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality
C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res
C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]
C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
python pandas dataframe
$endgroup$
1
$begingroup$
Group bycompany_id
then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32
$begingroup$
You try to accessdf['id']
but there is no such column. Did you meancompany_id
?
$endgroup$
– Emre
Apr 3 '18 at 16:45
add a comment |
$begingroup$
I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id
(string) and company_score
(float). There are approximately 10,000 unique company_id
's.
company_id company_score date_submitted company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW
My goal is to create approximately 10,000 new dataframes, by unique company_id
, with only the relevant rows in that data frame.
The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.
company_dictionary = {}
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()
Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?
Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.
[In] unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]
C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):
C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality
C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res
C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]
C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
python pandas dataframe
$endgroup$
I have a large data set (4.5 million rows, 35 columns). The columns of interest are company_id
(string) and company_score
(float). There are approximately 10,000 unique company_id
's.
company_id company_score date_submitted company_region
AA .07 1/1/2017 NW
AB .08 1/2/2017 NE
CD .0003 1/18/2017 NW
My goal is to create approximately 10,000 new dataframes, by unique company_id
, with only the relevant rows in that data frame.
The first idea I had was to create the collection of data frames shown below, then loop through the original data set and append in new values based on criteria.
company_dictionary = {}
for company in df['company_id']:
company_dictionary[company_id] = pd.DataFrame()
Is there a better way to do this by leveraging pandas? i.e., is there a way I can use a built-in pandas function to create new filtered dataframes with only the relevant rows?
Edit: I tried a new approach, but I'm now encountering an error message that I don't understanding.
[In] unique_company_id = np.unique(df[['ID_BB_GLOBAL']].values)
[In] unique_company_id
[Out] array(['BBG000B9WMF7', 'BBG000B9XBP9', 'BBG000B9ZG58', ..., 'BBG00FWZQ3R9',
'BBG00G4XRQN5', 'BBG00H2MZS56'], dtype=object)
[In] for id in unique_company_id:
[In] new_df = df[df['id'] == id]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:get_loc(self, key, method, tolerance)
2133 try:
-> 2134 return self._engine.get_loc(key)
2135 except KeyError:
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-50-dce34398f1e1> in <module>()
1 for id in unique_bank_id:
----> 2 new_df = df[df['id'] == id]
C: in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):
C: in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality
C: in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res
C: in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]
C: in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4433)()
pandasindex.pyx in pandas.index.IndexEngine.get_loc (pandasindex.c:4279)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13742)()
pandassrchashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:13696)()
KeyError: 'id'
python pandas dataframe
python pandas dataframe
edited Apr 3 '18 at 16:43
Aditya
1,4101525
1,4101525
asked Apr 2 '18 at 18:45
ForsakenPlagueForsakenPlague
42118
42118
1
$begingroup$
Group bycompany_id
then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32
$begingroup$
You try to accessdf['id']
but there is no such column. Did you meancompany_id
?
$endgroup$
– Emre
Apr 3 '18 at 16:45
add a comment |
1
$begingroup$
Group bycompany_id
then iterate over the results. Welcome to the site!
$endgroup$
– Emre
Apr 2 '18 at 20:32
$begingroup$
You try to accessdf['id']
but there is no such column. Did you meancompany_id
?
$endgroup$
– Emre
Apr 3 '18 at 16:45
1
1
$begingroup$
Group by
company_id
then iterate over the results. Welcome to the site!$endgroup$
– Emre
Apr 2 '18 at 20:32
$begingroup$
Group by
company_id
then iterate over the results. Welcome to the site!$endgroup$
– Emre
Apr 2 '18 at 20:32
$begingroup$
You try to access
df['id']
but there is no such column. Did you mean company_id
?$endgroup$
– Emre
Apr 3 '18 at 16:45
$begingroup$
You try to access
df['id']
but there is no such column. Did you mean company_id
?$endgroup$
– Emre
Apr 3 '18 at 16:45
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
You can groupby
company_id
column and convert its result into a dictionary of DataFrames:
import pandas as pd
df = pd.DataFrame({
"company_id": ["AA", "AB", "AA", "CD", "AB"],
"company_score": [.07, .08, .06, .0003, .09],
"company_region": ["NW", "NE", "NW", "NW", "NE"]})
# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}
# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))
import pprint
pprint.pprint(dict_of_companies)
Output:
{'AA': company_id company_region company_score
0 AA NW 0.07
2 AA NW 0.06,
'AB': company_id company_region company_score
1 AB NE 0.08
4 AB NE 0.09,
'CD': company_id company_region company_score
3 CD NW 0.0003}
$endgroup$
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties withgroupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
When you iterate over thegroupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a uniquecompany_id
and the second item corresponds to aDataFrame
containing the rows from the originalDataFrame
which are specific to that uniquecompany_id
.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45
add a comment |
$begingroup$
new = old[['A', 'C', 'D']].copy()
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29825%2fcreate-new-data-frames-from-existing-data-frame-based-on-unique-column-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You can groupby
company_id
column and convert its result into a dictionary of DataFrames:
import pandas as pd
df = pd.DataFrame({
"company_id": ["AA", "AB", "AA", "CD", "AB"],
"company_score": [.07, .08, .06, .0003, .09],
"company_region": ["NW", "NE", "NW", "NW", "NE"]})
# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}
# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))
import pprint
pprint.pprint(dict_of_companies)
Output:
{'AA': company_id company_region company_score
0 AA NW 0.07
2 AA NW 0.06,
'AB': company_id company_region company_score
1 AB NE 0.08
4 AB NE 0.09,
'CD': company_id company_region company_score
3 CD NW 0.0003}
$endgroup$
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties withgroupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
When you iterate over thegroupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a uniquecompany_id
and the second item corresponds to aDataFrame
containing the rows from the originalDataFrame
which are specific to that uniquecompany_id
.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45
add a comment |
$begingroup$
You can groupby
company_id
column and convert its result into a dictionary of DataFrames:
import pandas as pd
df = pd.DataFrame({
"company_id": ["AA", "AB", "AA", "CD", "AB"],
"company_score": [.07, .08, .06, .0003, .09],
"company_region": ["NW", "NE", "NW", "NW", "NE"]})
# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}
# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))
import pprint
pprint.pprint(dict_of_companies)
Output:
{'AA': company_id company_region company_score
0 AA NW 0.07
2 AA NW 0.06,
'AB': company_id company_region company_score
1 AB NE 0.08
4 AB NE 0.09,
'CD': company_id company_region company_score
3 CD NW 0.0003}
$endgroup$
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties withgroupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
When you iterate over thegroupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a uniquecompany_id
and the second item corresponds to aDataFrame
containing the rows from the originalDataFrame
which are specific to that uniquecompany_id
.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45
add a comment |
$begingroup$
You can groupby
company_id
column and convert its result into a dictionary of DataFrames:
import pandas as pd
df = pd.DataFrame({
"company_id": ["AA", "AB", "AA", "CD", "AB"],
"company_score": [.07, .08, .06, .0003, .09],
"company_region": ["NW", "NE", "NW", "NW", "NE"]})
# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}
# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))
import pprint
pprint.pprint(dict_of_companies)
Output:
{'AA': company_id company_region company_score
0 AA NW 0.07
2 AA NW 0.06,
'AB': company_id company_region company_score
1 AB NE 0.08
4 AB NE 0.09,
'CD': company_id company_region company_score
3 CD NW 0.0003}
$endgroup$
You can groupby
company_id
column and convert its result into a dictionary of DataFrames:
import pandas as pd
df = pd.DataFrame({
"company_id": ["AA", "AB", "AA", "CD", "AB"],
"company_score": [.07, .08, .06, .0003, .09],
"company_region": ["NW", "NE", "NW", "NW", "NE"]})
# Approach 1
dict_of_companies = {k: v for k, v in df.groupby('company_id')}
# Approach 2
dict_of_companies = dict(tuple(df.groupby("company_id")))
import pprint
pprint.pprint(dict_of_companies)
Output:
{'AA': company_id company_region company_score
0 AA NW 0.07
2 AA NW 0.06,
'AB': company_id company_region company_score
1 AB NE 0.08
4 AB NE 0.09,
'CD': company_id company_region company_score
3 CD NW 0.0003}
answered Apr 3 '18 at 5:24
tuomastiktuomastik
749418
749418
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties withgroupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
When you iterate over thegroupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a uniquecompany_id
and the second item corresponds to aDataFrame
containing the rows from the originalDataFrame
which are specific to that uniquecompany_id
.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45
add a comment |
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties withgroupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks
$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
When you iterate over thegroupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a uniquecompany_id
and the second item corresponds to aDataFrame
containing the rows from the originalDataFrame
which are specific to that uniquecompany_id
.
$endgroup$
– tuomastik
Sep 30 '18 at 10:45
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with
groupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
can you please explain why/how does #Approach 1 work? I've had a lot of difficulties with
groupby
, mainly because it returns a GroupBy object on which you have to run another aggregating fuction. But I tried this for a similar problem I had and it worked. Now I'm very curious on how/why! Thanks$endgroup$
– A. K.
Sep 28 '18 at 15:47
$begingroup$
When you iterate over the
groupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id
and the second item corresponds to a DataFrame
containing the rows from the original DataFrame
which are specific to that unique company_id
.$endgroup$
– tuomastik
Sep 30 '18 at 10:45
$begingroup$
When you iterate over the
groupby
object, a tuple of length 2 is returned on each loop. The first item of the tuple corresponds to a unique company_id
and the second item corresponds to a DataFrame
containing the rows from the original DataFrame
which are specific to that unique company_id
.$endgroup$
– tuomastik
Sep 30 '18 at 10:45
add a comment |
$begingroup$
new = old[['A', 'C', 'D']].copy()
New contributor
$endgroup$
add a comment |
$begingroup$
new = old[['A', 'C', 'D']].copy()
New contributor
$endgroup$
add a comment |
$begingroup$
new = old[['A', 'C', 'D']].copy()
New contributor
$endgroup$
new = old[['A', 'C', 'D']].copy()
New contributor
New contributor
answered 1 hour ago
CoddyCoddy
1
1
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29825%2fcreate-new-data-frames-from-existing-data-frame-based-on-unique-column-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Group by
company_id
then iterate over the results. Welcome to the site!$endgroup$
– Emre
Apr 2 '18 at 20:32
$begingroup$
You try to access
df['id']
but there is no such column. Did you meancompany_id
?$endgroup$
– Emre
Apr 3 '18 at 16:45