fix first two levels of decision tree?

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type.To achieve this,I have two proposals:

1.Build a separate tree for each combination of country & product type and use subsets of the data accordingly and pass on to respective tree for prediction.Saw here in comments.I have 88 levels in country and 3 levels in product type so it will generate 264 trees.

2.Build a basic tree with two variables namely country and product type with appropriate cp value to generate all combination as leaf nodes(264).Build a second tree with rest all variables and stack tree one upon tree two as a single decision tree.

I don't think the first one is the right way to do.Also, struck on how to stack the trees in second approach, even if it is not the right way would love to know how to achieve this.

Please guide me to approach the problem.Thanks.

edited May 23 '17 at 12:38

Community♦

asked Nov 1 '16 at 12:03

Aravind

162

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

4

$begingroup$
Why do you not like the first method?
$endgroup$
– Hobbes
Nov 1 '16 at 14:58

$begingroup$
@Hobbes It will be hard to monitor and tune the performance of each tree.
$endgroup$
– Aravind
Nov 2 '16 at 0:46

1

$begingroup$
What is the business problem? I had a similar case. We wanted the best set of prospects to target for each country/product group. The business felt that prospects in say South Africa for product A are very different from prospects in South Korea for product B. I could argue the merits of different marketing campaigns/messages/etc but that is the business's decision. I did not look at it as fixing the first 2 levels of the tree or any unnatural adjustments to an algorithm. I looked at it as how to find the best set of prospects for each country/product combination. Where I did not have enough d
$endgroup$
– Craig
Mar 3 '17 at 10:09

$begingroup$
@Aravind If you are worried about the tuning of each tree in Approach 1 then I would caution you that you might not be on the right track. Your decision to, essentially, hard-code the first two levels should be based on some business rules. If your intent is to keep the algorithm fixed then, are you really writing an algorithm? Are you not introducing a form of bias into your overall model? I would only be comfortable in proceeding if these choices were hard-coded and would rarely change. Otherwise you need to push back on the business and make them aware of the potential bias.
$endgroup$
– I_Play_With_Data
Oct 25 '18 at 18:02

add a comment |

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type.To achieve this,I have two proposals:

I don't think the first one is the right way to do.Also, struck on how to stack the trees in second approach, even if it is not the right way would love to know how to achieve this.

Please guide me to approach the problem.Thanks.

edited May 23 '17 at 12:38

Community♦

asked Nov 1 '16 at 12:03

Aravind

162

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

4

$begingroup$
Why do you not like the first method?
$endgroup$
– Hobbes
Nov 1 '16 at 14:58

$begingroup$
@Hobbes It will be hard to monitor and tune the performance of each tree.
$endgroup$
– Aravind
Nov 2 '16 at 0:46

1

$begingroup$
What is the business problem? I had a similar case. We wanted the best set of prospects to target for each country/product group. The business felt that prospects in say South Africa for product A are very different from prospects in South Korea for product B. I could argue the merits of different marketing campaigns/messages/etc but that is the business's decision. I did not look at it as fixing the first 2 levels of the tree or any unnatural adjustments to an algorithm. I looked at it as how to find the best set of prospects for each country/product combination. Where I did not have enough d
$endgroup$
– Craig
Mar 3 '17 at 10:09

$begingroup$
@Aravind If you are worried about the tuning of each tree in Approach 1 then I would caution you that you might not be on the right track. Your decision to, essentially, hard-code the first two levels should be based on some business rules. If your intent is to keep the algorithm fixed then, are you really writing an algorithm? Are you not introducing a form of bias into your overall model? I would only be comfortable in proceeding if these choices were hard-coded and would rarely change. Otherwise you need to push back on the business and make them aware of the potential bias.
$endgroup$
– I_Play_With_Data
Oct 25 '18 at 18:02

add a comment |

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type.To achieve this,I have two proposals:

I don't think the first one is the right way to do.Also, struck on how to stack the trees in second approach, even if it is not the right way would love to know how to achieve this.

Please guide me to approach the problem.Thanks.

edited May 23 '17 at 12:38

Community♦

asked Nov 1 '16 at 12:03

Aravind

162

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type.To achieve this,I have two proposals:

I don't think the first one is the right way to do.Also, struck on how to stack the trees in second approach, even if it is not the right way would love to know how to achieve this.

Please guide me to approach the problem.Thanks.

machine-learning r predictive-modeling decision-trees

edited May 23 '17 at 12:38

Community♦

asked Nov 1 '16 at 12:03

Aravind

162

edited May 23 '17 at 12:38

Community♦

asked Nov 1 '16 at 12:03

Aravind

162

edited May 23 '17 at 12:38

Community♦

edited May 23 '17 at 12:38

Community♦

edited May 23 '17 at 12:38

Community♦

asked Nov 1 '16 at 12:03

Aravind

162

asked Nov 1 '16 at 12:03

Aravind

162

asked Nov 1 '16 at 12:03

Aravind

162

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 12 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

4

$begingroup$
Why do you not like the first method?
$endgroup$
– Hobbes
Nov 1 '16 at 14:58

$begingroup$
@Hobbes It will be hard to monitor and tune the performance of each tree.
$endgroup$
– Aravind
Nov 2 '16 at 0:46

1

$begingroup$
What is the business problem? I had a similar case. We wanted the best set of prospects to target for each country/product group. The business felt that prospects in say South Africa for product A are very different from prospects in South Korea for product B. I could argue the merits of different marketing campaigns/messages/etc but that is the business's decision. I did not look at it as fixing the first 2 levels of the tree or any unnatural adjustments to an algorithm. I looked at it as how to find the best set of prospects for each country/product combination. Where I did not have enough d
$endgroup$
– Craig
Mar 3 '17 at 10:09

$begingroup$
@Aravind If you are worried about the tuning of each tree in Approach 1 then I would caution you that you might not be on the right track. Your decision to, essentially, hard-code the first two levels should be based on some business rules. If your intent is to keep the algorithm fixed then, are you really writing an algorithm? Are you not introducing a form of bias into your overall model? I would only be comfortable in proceeding if these choices were hard-coded and would rarely change. Otherwise you need to push back on the business and make them aware of the potential bias.
$endgroup$
– I_Play_With_Data
Oct 25 '18 at 18:02

add a comment |

4

$begingroup$
Why do you not like the first method?
$endgroup$
– Hobbes
Nov 1 '16 at 14:58

$begingroup$
@Hobbes It will be hard to monitor and tune the performance of each tree.
$endgroup$
– Aravind
Nov 2 '16 at 0:46

1

$begingroup$
What is the business problem? I had a similar case. We wanted the best set of prospects to target for each country/product group. The business felt that prospects in say South Africa for product A are very different from prospects in South Korea for product B. I could argue the merits of different marketing campaigns/messages/etc but that is the business's decision. I did not look at it as fixing the first 2 levels of the tree or any unnatural adjustments to an algorithm. I looked at it as how to find the best set of prospects for each country/product combination. Where I did not have enough d
$endgroup$
– Craig
Mar 3 '17 at 10:09

$begingroup$
@Aravind If you are worried about the tuning of each tree in Approach 1 then I would caution you that you might not be on the right track. Your decision to, essentially, hard-code the first two levels should be based on some business rules. If your intent is to keep the algorithm fixed then, are you really writing an algorithm? Are you not introducing a form of bias into your overall model? I would only be comfortable in proceeding if these choices were hard-coded and would rarely change. Otherwise you need to push back on the business and make them aware of the potential bias.
$endgroup$
– I_Play_With_Data
Oct 25 '18 at 18:02

Why do you not like the first method?

– Hobbes
Nov 1 '16 at 14:58

@Hobbes It will be hard to monitor and tune the performance of each tree.

– Aravind
Nov 2 '16 at 0:46

What is the business problem? I had a similar case. We wanted the best set of prospects to target for each country/product group. The business felt that prospects in say South Africa for product A are very different from prospects in South Korea for product B. I could argue the merits of different marketing campaigns/messages/etc but that is the business's decision. I did not look at it as fixing the first 2 levels of the tree or any unnatural adjustments to an algorithm. I looked at it as how to find the best set of prospects for each country/product combination. Where I did not have enough d

– Craig
Mar 3 '17 at 10:09

@Aravind If you are worried about the tuning of each tree in Approach 1 then I would caution you that you might not be on the right track. Your decision to, essentially, hard-code the first two levels should be based on some business rules. If your intent is to keep the algorithm fixed then, are you really writing an algorithm? Are you not introducing a form of bias into your overall model? I would only be comfortable in proceeding if these choices were hard-coded and would rarely change. Otherwise you need to push back on the business and make them aware of the potential bias.

– I_Play_With_Data
Oct 25 '18 at 18:02

add a comment |

2 Answers
2

active

oldest

votes

Depending which tree algorithm you want to use you could manually construct the two first levels of the tree. You can just follow the pseudo code explained for example here for the C4.5 tree. Once you have done this you can remove the two features from the data set and create trees for the remaining part of the tree. If you want to create a rpart object you would be required to take some parts of the source and this may be a bit more demanding. Depending on what tree algorithm you use you will just have a binary split at both levels so you will only need to build 4 separate trees and not 264. Note that you may not have the optimal decision tree since after stepping through the first two levels, the country and product type may still be variables that cause a split. But without seeing the data is impossible to tell.

Side note, it may be valuable to explain the business that country and product type are not the most sensible variables to have in the top of the decision tree. Sometimes it is better to educate the end users than to force machine learning to do something inaccurate. In my experience end users prefer to have a correct solution than a solution that works because people have a gut feeling that it should be in a certain way.

answered Nov 2 '16 at 11:47

Stereo

1,303423

$begingroup$
I have 88 levels in country and 3 levels in product type so it will be 264 trees.if there are only 4 separate trees then i will take the easy option namely the first choice.I feel it will be easier to convince end user when i have the results for both what they want and the correct way of solving the problem. Can you help me find reference material for stacking two trees after built completely?
$endgroup$
– Aravind
Nov 3 '16 at 1:05

$begingroup$
What you could do is calculate the entropy or gini for country, product type and the first element that gets selected by CHAID and C4.5. Educate users on these metrics. If that fails you can always go back. Additionally when you run a binary decision tree the first splits will lump countries or/and products together so at a minimum 4 subtrees.
$endgroup$
– Stereo
Nov 3 '16 at 10:12

add a comment |

I think you could do this fairly automatically if you're open to using Python. A library called auto_ml* has a feature called categorical ensembling, where you can explicitly say "I want a model built for each level of this feature". If you made a feature that was country-product type and used that as your category, the rest should be pretty easy.

*Disclosure: I've made minor contributions to auto_ml. It is FOSS under the MIT license.

answered Jun 1 '17 at 12:26

CalZ

1,438213

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14864%2ffix-first-two-levels-of-decision-tree%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Nov 2 '16 at 11:47

Stereo

1,303423

$begingroup$
I have 88 levels in country and 3 levels in product type so it will be 264 trees.if there are only 4 separate trees then i will take the easy option namely the first choice.I feel it will be easier to convince end user when i have the results for both what they want and the correct way of solving the problem. Can you help me find reference material for stacking two trees after built completely?
$endgroup$
– Aravind
Nov 3 '16 at 1:05

$begingroup$
What you could do is calculate the entropy or gini for country, product type and the first element that gets selected by CHAID and C4.5. Educate users on these metrics. If that fails you can always go back. Additionally when you run a binary decision tree the first splits will lump countries or/and products together so at a minimum 4 subtrees.
$endgroup$
– Stereo
Nov 3 '16 at 10:12

add a comment |

answered Nov 2 '16 at 11:47

Stereo

1,303423

$begingroup$
I have 88 levels in country and 3 levels in product type so it will be 264 trees.if there are only 4 separate trees then i will take the easy option namely the first choice.I feel it will be easier to convince end user when i have the results for both what they want and the correct way of solving the problem. Can you help me find reference material for stacking two trees after built completely?
$endgroup$
– Aravind
Nov 3 '16 at 1:05

$begingroup$
What you could do is calculate the entropy or gini for country, product type and the first element that gets selected by CHAID and C4.5. Educate users on these metrics. If that fails you can always go back. Additionally when you run a binary decision tree the first splits will lump countries or/and products together so at a minimum 4 subtrees.
$endgroup$
– Stereo
Nov 3 '16 at 10:12

add a comment |

answered Nov 2 '16 at 11:47

Stereo

1,303423

answered Nov 2 '16 at 11:47

Stereo

1,303423

answered Nov 2 '16 at 11:47

Stereo

1,303423

answered Nov 2 '16 at 11:47

Stereo

1,303423

answered Nov 2 '16 at 11:47

Stereo

1,303423

$begingroup$
I have 88 levels in country and 3 levels in product type so it will be 264 trees.if there are only 4 separate trees then i will take the easy option namely the first choice.I feel it will be easier to convince end user when i have the results for both what they want and the correct way of solving the problem. Can you help me find reference material for stacking two trees after built completely?
$endgroup$
– Aravind
Nov 3 '16 at 1:05

$begingroup$
What you could do is calculate the entropy or gini for country, product type and the first element that gets selected by CHAID and C4.5. Educate users on these metrics. If that fails you can always go back. Additionally when you run a binary decision tree the first splits will lump countries or/and products together so at a minimum 4 subtrees.
$endgroup$
– Stereo
Nov 3 '16 at 10:12

add a comment |

$begingroup$
I have 88 levels in country and 3 levels in product type so it will be 264 trees.if there are only 4 separate trees then i will take the easy option namely the first choice.I feel it will be easier to convince end user when i have the results for both what they want and the correct way of solving the problem. Can you help me find reference material for stacking two trees after built completely?
$endgroup$
– Aravind
Nov 3 '16 at 1:05

$begingroup$
What you could do is calculate the entropy or gini for country, product type and the first element that gets selected by CHAID and C4.5. Educate users on these metrics. If that fails you can always go back. Additionally when you run a binary decision tree the first splits will lump countries or/and products together so at a minimum 4 subtrees.
$endgroup$
– Stereo
Nov 3 '16 at 10:12

I have 88 levels in country and 3 levels in product type so it will be 264 trees.if there are only 4 separate trees then i will take the easy option namely the first choice.I feel it will be easier to convince end user when i have the results for both what they want and the correct way of solving the problem. Can you help me find reference material for stacking two trees after built completely?

– Aravind
Nov 3 '16 at 1:05

What you could do is calculate the entropy or gini for country, product type and the first element that gets selected by CHAID and C4.5. Educate users on these metrics. If that fails you can always go back. Additionally when you run a binary decision tree the first splits will lump countries or/and products together so at a minimum 4 subtrees.

– Stereo
Nov 3 '16 at 10:12

add a comment |

*Disclosure: I've made minor contributions to auto_ml. It is FOSS under the MIT license.

answered Jun 1 '17 at 12:26

CalZ

1,438213

add a comment |

*Disclosure: I've made minor contributions to auto_ml. It is FOSS under the MIT license.

answered Jun 1 '17 at 12:26

CalZ

1,438213

add a comment |

*Disclosure: I've made minor contributions to auto_ml. It is FOSS under the MIT license.

answered Jun 1 '17 at 12:26

CalZ

1,438213

*Disclosure: I've made minor contributions to auto_ml. It is FOSS under the MIT license.

answered Jun 1 '17 at 12:26

CalZ

1,438213

answered Jun 1 '17 at 12:26

CalZ

1,438213

answered Jun 1 '17 at 12:26

CalZ

1,438213

answered Jun 1 '17 at 12:26

CalZ

1,438213

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

emlBw0wPV8 2YnSGZjnJ9K,kdwEXx,PpVV8nfbiEaaTqJjGPw8pVDGwUGHf h7XpCYsJo9 KbjI,eI0whAtcJW,HH0,mNdy,yQPV4SsFkm,c

搜尋此網誌

Gfyuki