6.2.1. Descriptive statistics for model building
Leading players in the US market include International Business Machines Corp. (IBM), Microsoft Corp. and Oracle Corp. (MarketLine, 2018g).
The US market is characterized by high competition and constant technological progress, requiring specific and modifiable knowledge resources (MarketLine, 2018g). Patents are important in the industry, but very complex: Patent infringement is often a problem for new market entrants and copyright wars as well as anti-trust lawsuits are common (MarketLine, 2018d). Furthermore, the industry is characterized by a high disruptive potential and hence an elevated degree of risk (Chitkara, Gloger, & McCaffrey, 2018). This is demonstrated, for example, by recently imposed regulations, such as the EU’s General Data Protection Regulation, which addresses unforeseen effects of the industry’s products (Chitkara et al., 2018). To keep track with technological changes and new standards, large corporations frequently engage in M&A-related activities to gain access to the technological capabilities of innovative smaller companies (MarketLine, 2018g).
In our dataset, organizations active in the industry with the three-digit SIC code 737 are most prevalent, amounting to 90 out of 161 investors with a total of 1523 investments and an estimated equity of over USD 7 billion. Google is the most active investor with 305 investments, followed by Microsoft Corp. and the CVC unit of SAP SE.
In summary, the industry overview shows that technology plays a major role in all of the three industries, and that these industries constantly have to adapt to change. As CVC is often regarded as a mean to access new technologies, it appears reasonable that those three industries are the most active investors in CVC. Entrepreneurs in these three segments, next to Telecommunication &
Networking, have also been most targeted by CVC activity (Cumming, 2012). Furthermore, the importance of patenting activities in these sectors is highlighted, which supports the application of the chosen predictors in the model based on patent information.
we will look at how the variables are correlated, and make necessary changes to address potential multicollinearity problems, which could impede internal validity of the model as described in section 5.3.3.
a) Distributions and summary statistics
Table 15 shows a summary of the main statistics of each of the variables in our analysis. As previously explained, only investors based in the United States are included (161 distinct CVC units). A detailed summary as well as the graphed distribution of each of the variables can be found in Appendix H.
Table 15: Summary statistics of variables
Variable Number of
Mean Std. Dev. Median Min Max
cum_fc_g 136 25874.47 133659 1770.5 0 1485383
share_bsc_app_cum 156 .0489736 .0689145 .019913 0 .3261895
cum_patents_app 161 1332 6388.593 58 0 75546
sd_tot_uspc_app 144 39.68134 81.65334 7.59869 0 680.4418
cum_dist_uspc_app 156 32.00641 45.09338 14.5 0 298
num_investments_tot 161 30.82609 130.7358 5 1 1551
equity_est_firmname_tot 161 138.0093 690.5847 21.24 .066 8471.891 same_sic_proportion_mean 159 .5994788 .3508371 .685185 0 1
same_nation_proportion_mean 161 .8927167 .2207646 1 0 1
comp_age_avg_mean 157 4.461598 3.42369 4.2 0 37.25
num_coinvestors_round_mean 161 4.12845 2.031323 4.0625 0 13
num_corpinv_round_mean 161 .569109 .5240646 .5 0 2.666667
As previously explained, forward citations were counted by granted date, whereas the number of patents was retrieved by application date. This is the main reason why the number of observations is lower for forward citations (cum_fc_g) than for the number of patents (cum_patents_app): A few organizations do not have any granted patents by the date of the last investments (and hence, forward citations are essentially missing values in these cases), but have applied-for patents (which were later granted). The distribution of both variables, however, looks very similar. For both total forward citations and total number of patents, the distribution is highly skewed to the right. The maximum number of forward citations is 1,485,383 – these are forward citations of granted patents by International Business Machines Corp (IBM), a company active in the industry Computer Integrated Systems Design (sic_4=7373). This is an extreme, however, as 50% of organizations
received 1,771 or fewer (median value), and 99% of organizations received 359,559 or fewer forward citations to their granted patents. Similarly, the highest number of applied-for (and later granted) patents, namely 75,546, is also assigned to IBM. The median observation, however, is much lower with only 58 patents, and the 99th percentile amounts to 19,555 patents, which is still much below the maximum value. Only five organizations do not count any granted patents which were applied for by the maximum year of investment of their CVC unit. Four of these are also active in Computer Programming, Data Processing, And Other Computer Related Services (sic_3=737), all of which invested prior to 2003.
Evidently, the distribution of both patent and forward citation count is skewed. For this reason, the variables will be logged. To no surprise, the data on the number of patents and forward citations further shows that firms, but also industries, differ in their patenting activity. Whether these
differences are significantly related to the structure of the CVC unit will be examined in our model.
The share of backward self-citations in total backward citations (share_bsc_app_cum) is also positively skewed, which is the reason why its natural logarithm will be employed in the model.
While the mean lies around 5%, the organization Merck & Co, active in Pharmaceutical
Preparations (sic_4=2834), has the highest share of backward self-citations for 4,757 applied for (and later granted) patents, amounting to more than 32%. In plain words, this means that 32% of Merck & Co’s backward citations come from their own patents. However, it is worth noting that in the Drugs industry (SIC code 283), the mean share (9.3%) is higher than in both 367 and 737 with a mean of 5.7% and 2.5% respectively (see Appendix I for overview based on 3-digit SIC codes), i.e.
the CVC units active in the Drugs industry (from our sample) have a relatively high share of self-citations.
With regards to the standard deviation of the number of patents in different technology classes (sd_tot_uspc_app), the mean of approximately 39.9 is much higher than the median of 7.6, and the distribution is skewed to the right. The organizations with a standard deviation in the top 5th
percentile hold a minimum of 3,868 patents and include IBM, Texas Instruments Inc, Advanced Micro Devices, Merck & Co and Pfizer - coming from all three industries. IBM is also the company with the highest number of distinct USPC classes (cum_dist_uspc_app). There are nine
organizations which only patented in one USPC main class, mostly from software and
programming. However, most of them have a very low number of patents (maximum of 5), indicating a correlation between the two variables, which will be examined in sub-section b).
The control variables show a lower degree of skewness (see Appendix H), with the exception of the number of investments (num_investments_tot), total estimated equity invested
(equity_est_firmname_tot) and the number of distinct USPC classes (cum_dist_uspc_app), which are all skewed to the right. With a total number of 1551 investments and an equity estimate of USD 8,472 million, the most active corporate investor is by far Intel Corp, which invests through its external CVC unit Intel Capital Corp. The second biggest investor is Johnson & Johnson, a pharmaceutical company investing through an external CVC unit, with 350 and hence much less total investments. Half of the investors in the sample have made 5 or fewer investments. The
median for invested equity is also much lower than its mean, namely USD 21.24 million as opposed to USD 138 million. Interestingly, this implies that most of the observed CVC activity is performed by a few actors. Specifically, almost 75% of all CVC activity in terms of number of investments (and 77% in terms of equity invested) is performed by merely 10% of the investment units.
In summary, it becomes evident that in the model, the natural logarithm should be employed for all independent variables, namely cum_fc_g, cum_patents_app, share_bsc_app_cum and
sd_tot_uspc_app, due to positive skewness. With regards to control variables, taking the natural logarithm of num_investments_tot, equity_est_firmname_tot and cum_dist_uspc_app is meaningful.
However, for the remaining variables, a mean has already been taken while collapsing the dataset as described in section 5.1.4. Hence, even though slightly skewed, transforming those variables would eliminate too much of the variance, and we thus choose not to transform them to maintain the integrity of these variables.
In a next step, we will look at the how the variables, which we want to insert in our model (transformed as previously described) are related. High correlations indicate potential
multicollinearity, which could lead to an imprecise estimation of the partial effects of the regression coefficients in form of a large sampling variance (Stock & Watson, 2015). This implies that it is difficult to detangle the different predictors. Changing the set of predictors is a possible solutions to multicollinearity problems (Stock & Watson, 2015).
Correlations between all of the independent and control variables are shown inTable 16.
Table 16: Full correlations matrix
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
(2) 0.6608 1.0000
(3) 0.8194 0.7457 1.0000
(4) 0.7880 0.7169 0.9322 1.0000
(5) 0.7349 0.5880 0.7753 0.6364 1.0000
(6) 0.1787 0.2331 0.3547 0.2769 0.2105 1.0000
(7) 0.1225 0.1992 0.3688 0.2534 0.1920 0.8835 1.0000
(8) -0.0284 -0.1478 0.0174 -0.1245 -0.1948 0.0189 -0.0128 1.0000
(9) -0.0152 0.0152 0.0418 0.0019 -0.0244 0.0748 -0.0917 0.1856 1.0000
(10) -0.0060 0.0268 0.0214 0.0713 0.0264 0.0709 0.0880 -0.1161 0.0077 1.0000
(11) -0.0232 -0.1215 0.0376 -0.0843 0.0413 0.1455 0.0940 -0.1278 0.1219 -0.1159 1.0000 (12) -0.0665 -0.0308 0.0468 -0.0330 0.0503 0.0454 0.0837 -0.1325 0.1036 -0.0134 0.5118 1.000 Note. Variables are denoted as follows: (1) cum_fc_g_ln, (2) share_bsc_app_cum_ln, (3) cum_patents_app_ln, (4) sd_tot_uspc_app_ln, (5) cum_dist_uspc_app_ln, (6) num_investments_tot_ln, (7) equity_est_firmname_tot_ln, (8) same_sic_proportion_mean, (9) same_nation_proportion_mean, (10) comp_age_avg_mean,
(11) num_coinvestors_round_mean, (12) num_corpinv_round_mean
As shown, there are several very highly correlated variables. Each of those pairs will be discussed in separate below.
Firstly, it is evident that (3) cum_patents_app_ln, the cumulative number of applied-for patents, is highly correlated with many variables, namely (1) cum_fc_g_ln (𝜌 = 0.8194), (4)
sd_tot_uspc_app_ln (𝜌 = 0.9322), and (2) share_bsc_app_cum_ln (𝜌 = 0.7457) and (5) cum_dist_uspc_app_ln (𝜌 = 0.7753). We used that variable to measure absorptive capacity, as explained in section 0. As the model does not work with such high correlations, we must omit one or more variables. In this case, we choose to omit variable (3), cum_patents_app_ln, as it has high correlations with all other independent variables. This means that we will be unable to conclude on the theorized effects for absorptive capacity. 30
30 While this thesis will not conclude on findings for absorptive capacity, it should be mentioned that the very high correlation its proxy variable has with the proxy for value of innovations (forward citations) essentially is evidence that the two variables measure very similar concepts.
Secondly, (1) cum_fc_g_ln and (4) sd_tot_uspc_app_ln are highly correlated (𝜌 = 0.7880). To offset this issue, we transform this variable into a binary variable instead (called
sd_tot_uspc_app_bin), taking the value of 0 for a low standard deviation and the value 1 for a high standard deviation of USPC classes. We base this distinction on the median instead of the mean value of the un-transformed variable sd_tot_uspc_app (7.59869) to account for the previously described positive skewness. This transformation is also used to accommodate the issue of high correlation between (2) share_bsc_app_cum_ln and (4) sd_tot_uspc_app_ln (𝜌 = 0.7169).
Thirdly, there is a high correlation between the independent variable (1) cum_fc_g_ln and the control variable (5) cum_dist_uspc_app_ln (𝜌 = 0.7349). As a consequence, we decide to exclude the number of distinct USPC classes as a control variable. While previously argued that the standard deviation of USPC dispersion is only meaningful in conjunction with this variable, the high
correlation with cum_fc_g_ln implies that the model effectively still is specified sufficiently.
Fourthly, the variables (6) num_investments_tot_ln and (7) equity_est_firmname_tot_ln show a high correlation (𝜌 = 0.8835). Essentially, they both control to which extend the magnitude and scope of the CVC activity is related to the set-up as an internal or external unit. Based on running a maximum-likelihood regression with each of the two as the sole independent variable separately (see Appendix J), we deem (5) num_investments_tot_ln the most apt control variable of the two and eliminate (6) equity_est_firmname_tot_ln.
Lastly, even though the correlation between (11) num_coinvestors_round_mean and (12)
num_corpinv_round_mean is acceptable (𝜌 = 0.5118), we transformed num_corpinv_round_mean into a binary variable as well. We consider this meaningful as the decision to co-invest with another corporate investor (who also has strategic interest) itself matters more than the count of actual corporate co-investors (for the purposes of this analysis). Hereby, the variable
num_coinvestors_round_mean includes both corporate and other investors, and hence is sufficient to control for the number of other investors participating in the investments. Consequently, a binary variable is introduced, corp_co_invest, which takes the value 1 if the number of corporate co-investors is at least one and takes the value 0 if the co-investors do not invest alongside other corporate investors.
The final correlations matrix of model input variables with transformed and adjusted variables is shown inTable 17.
Table 17: Final correlations matrix of model input variables
(1) (2) (3) (4) (5) (6) (7) (8) (9)
(2) 0.6608 1.0000
(3) 0.6642 0.6442 1.0000
(4) 0.1787 0.2331 0.2803 1.0000
(5) -0.0284 -0.1478 -0.0924 0.0189 1.0000
(6) -0.0152 0.0152 0.0077 0.0748 0.1856 1.0000
(7) -0.0060 0.0268 0.0243 0.0709 -0.1161 0.0077 1.0000
(8) -0.0232 -0.1215 -0.0864 0.1455 -0.1278 0.1219 -0.1159 1.0000
(9) 0.0636 0.1204 0.1404 0.4895 -0.0413 0.1332 0.0639 0.3729 1.0000 Note. Variables are denoted as follows: (1) cum_fc_g_ln, (2) share_bsc_app_cum_ln,
(3) sd_tot_uspc_app_bin, (4) num_investments_tot_ln, (5) same_sic_proportion_mean,
(6) same_nation_proportion_mean, (7) comp_age_avg_mean, (8) num_coinvestors_round_mean, (9) corp_co_invest
As can be seen, some variables still have correlation coefficients greater than 60%. As we do not encounter perfect multicollinearity in the model, and since the correlations are not too high for the separate variables to be meaningful, the correlations are deemed acceptable.31