Question

Topic: Research/Metrics

Regression Output

Posted by Anonymous on 1955 Points
Can someone sanity check these regression results for me? The results seem counter-intuitive to me.

Given the low p-value I'm rejecting the null that the x variable has no effect on the dependant. But, the low r-squared value leads me to believe the results aren't helpful anyway...? Yet, the overall Significance F score seems to say that it's valid.

So, in other words, the equaition is saying that with every unit increase in x we would expect an increase $99,542 (if the y variable were measured in dollars). Yet, when I look at the data I see an inverse relationship (observations with low units have higher dollars, and observations with more units (5 is the limit) have far fewer dollars)... Not sure what the problem is or what's happening.

Regression Statistics

Multiple R = 0.125976465
R Square = 0.01587007
Adjusted R Square = 0.015694458
Standard Error = 214664.046
Observations = 5606
ANOVA
Regrssion: df=1 SS=4.16431E+12 MS=4.16431E+12 F=90.37004933 Significance F=2.85523E-21

Residual: df=5604 SS=2.58236E+14 MS=46080652625

Total: df=5605 SS=2.624E+14

Intercept: Coefficients=-41406.89864 Standard Error=11501.13051 t Stat=-3.600245958 P-value=0.000320666 Lower 95%=-63953.56927 Upper 95%=-18860.22801 Lower 95.0%=-63953.56927 Upper 95.0%=-18860.22801

X Variable 1: Coefficients=99542.75227 Standard Error=10471.22242 t Stat=9.506316286 P-value=2.85523E-21 Lower 95%=79015.10042 Upper 95%=120070.4041 Lower 95.0%=79015.10042 Upper 95.0%=120070.4041

THANKS MUCH!

-Brian
To continue reading this question and the solution, sign up ... it's free!

RESPONSES

  • Posted by koen.h.pauwels on Accepted
    Hi Brian,

    Your model looks fine (with the available information), but let's interpret the relevant numbers:
    1) the coefficient estimate is negative! So you do actually find a NEGATIVE relationship between x and y, consistent with your eyeballing the data
    2) this coefficient estimate is significantly different from zeor, so you are correct in rejecting the null of no effect
    3) the F-value of the total model is significant, so your model indeed helps you to explain the dependent variable
    4) is the R-squared too low? It really depends on your perspective. In marketing academia, consumer behavior researchers typically get R-squares of less than 0.10 (if they would report them :-)), while quantitative researchers like myself get R-squares of over 0.80. The main reason is that quant researchers often have data on more variables (eg prices of the company and competitors, etc.), and thus can explain more variance in sales than models that only have e.g. the city market's average demo- and psychographics. By the way, it is easy to increase your R-square in time series models by including Y(t-1), e.g. past sales. Is your model now necessarily better?

    In short, I do not think your model is garbage. The low R-squared does indicate it is a good idea to add more independent variables in order to better explain (and predict) sales. At the same time, check whether your data is far away from normal, in which case transformations may help you to get a close-to-normal distribution. You hint to this by saying there is an upper bound of 5 for units, which makes me think that other variables also have such bounds.
  • Posted by Dawson on Accepted
    Brian,

    let me ask one question - what type of variable is the dependent? Is it ordinal or continuous? I ask because if the data is ordinal, OLS regression may not be the best way to analyse it.

    On the basis that the data is continuous and represents some measure of "sales", I've not seen R-squared measures this low in this type of model. At the very least, if you are explaining this in a commercial environment (as opposed to an academic one), you will face some challenges explaining the validity of what you have done here. Rightly or wrongly, very few people will feel instinctively comfortable with your results.

    At the very least, the R-squared value should tell you that there are many more important factors which can be used to explain performance.
    John
  • Posted on Accepted
    Hello Brian,

    Is there a way to have a look at your data? E.g. can you post ~10 data points here? To protect your data it’s ok for me when you rescale them. I’m only interested in the visual impression you talk about.

    A counter intuitive result can appear, e.g. when you try to fit a linear function (y = mx + b) to an inverse function (y = p/x + q) very strange numerical things can happen and usually have large (residual) errors and low R-values: both functions simply don’t fit well, at least over a wide data range. They can fit well over a short range.

    Which brings me to the question what you are trying to accomplish? Are you looking for the best possible fit, i.e. the best possible description of findings? Then you should select a fit-function which exhibits the important properties of your observed data (don’t fit each and every peak). This can be a data transformation (y -> 1/y or x -> 1/x) as blanalytics states, it can be a shift, a polynomial (better: don’t) or other functions. Depending on your purpose you may even argue that a bad linear fit is still good enough.

    These are fundamental questions you have to answer more from an engineering perspective. It’s an input, a decision you make. Statistics can’t answer this for you. - Identifying a variable as a significant contributor is a different story. A way to do it is to start with fitting the most prominent variable to your data and fit the residuals to other variables, step-by-step. However, this is not the best way to do it.

    This brings me more to the research perspective. Is this a new research field? Then you can only look for a best, plausible fit as I described above. Do you have (let’s exaggerate it a little) ‘scientific’ reasoning available for this problem? E.g. in physics we know something about forces and gravity and we can expect a certain ‘law’ in measurements, e.g. y = m * x^2, where y is a distance and x is time and we would try to fit a good m-value (using y = m * x^2 + n * x + o would be a more careful approach). So comparing data to this expectation would be more than reasonable.

    In other words: can you base your intuitive expectation on some, commonly accepted, reasoning and formulate it as some fit-equation? That would be a good starting point for refinements.

    Other questions I have are about the quality of your data and about reliability of future predictions you may want to do. If your data are unreliable, statistics will report this to you in various high-error messages. No matter, how good your fit turns out to be today, how will you know how good it will be tomorrow, when applying to a different (data) situation? (I could give you the route to those results, but that is another story.)

    Hope this helps,
    Michael
  • Posted on Author
    Thank you for all of the responses, well, for most of them anyway. As I said, I'm looking for someone (interested in being helpful) who might be able to sanity check the results. I appreciate those of you who had this in mind and took the time to respond.

    The hypothesis I'm testing is whether the expected revenue (y) grows with additional business units sold into an engagement. The results shown above seem to say that the more business units the more revenue (a seemingly logical conclusion, I know).

    HOWEVER, when we look at the actual data, what we're finding is that the engagements that have only one business unit involved have higher per-engagement revenue. The engagements that have five business units involved have resulted in not only lower per-business-unit revenue, but lower (generally speaking) total revenue. So, the conclusion that the more business units we involve in an engagement the higher the engagement revenue (as stated in by the equation) doesn't appear to make sense.

    That being said, the argument that the data just doesn't seem to make sense is okay (thanks for putting it so professionally). It is indeed possible that we have not run the right regression (etc...) procedure... hence the posting.

    Again, many thanks to those who took the time to provide an actual response.
  • Posted by steven.alker on Member
    Brian

    Sorry I missed this one – I like the maths questions. The figures don’t look bonkers but they are certainly counter intuitive.

    Simple question – when you plot your data sets are they approximating to linear equations or not? If not you can’t use that type of regression as you’ll get different answers for different start points and two answers for every attempt! (Or 3 or 4 depending on the power law)


    Maybe a bit of visual analysis could help as you are potentially working in n-dimensional spaces. Tableau Software is pretty good at this allowing you to visualise information via cross-tabs in 2, 3 and n dimensions.

    Once you can see you your fixed and contingent variables look, and then you might try some linearization – only then will a regression make sense.

    Steve Alker

Post a Comment