5. Methodology
5.1. Data collection
5.1.2. Patent data
- The name of the parent organization (organization)
- The unique ID for each parent organization (idorganization) - The industry classification of the parent organization (sic_4) - The source of the parent organization (organization_source) - The structure of the corporate investor / CVC unit (subsidiary).
From this data, firmname, firmnation, year_inv_min and year_inv_max were sourced from the Thomson One Banker database, organization and sic_4 were searched manually and matched with Compustat if possible, which is indicated in organization_source. The variables idinvestor,
idorganization and subsidiary were assigned by the authors.
(there are also design patents, for example), only those patents were kept. Secondly, we merged in the citation data by collapsing18 data on the level of the patent ID. The technical explanations will follow:
- Count of backward and forward citations: Based on citation information of PatentsView, we counted citing (forward citations) as well as cited (backward citations) patents per patent ID.
More simply: for each patent, the number of other patents that it cites and the number of other patents that cite it were counted.
- Count of backward and forward self-citations: If the assignee ID of the citing patent equals the assignee ID of the cited patent, it was counted as a self-citation for the cited patent (and vice versa). More simply: For any given patent of an organization, the number of citations to this patent (and from this patent, respectively) made by patents of the organization that created the patent were counted.
In the third iteration of the sample construction, the sample resulting from the second iteration was linked to the patent database through the name of the parent organization. In general, the USPTO does not assign a unique organization ID for each individual firm in patent filings: As organizations use different names or abbreviations, and names frequently contain spelling errors in patent filings, it is difficult to retrieve all patent information belonging to a specific firm – a problem widely recognized in patent-related research (e.g. Hall, Jaffe, & Trajtenberg, 2005). To mitigate this problem, PatentsView uses a disambiguation algorithm to assign unique IDs for each organization (the assignee ID, as described above). As our data sample contains organization names only, we needed to link the sample to the PatentsView database via the organizations’ names in order to derive the unique assignee ID used by PatentsView. This would then enable us to retrieve an exhaustive list of patents assigned to the organization, including those where the assignee’s name was spelled differently.
As there is no unambiguous common identifier between the two datasets resulting from the potential spelling differences of organization names, standard merging using merge in Stata is impossible. Therefore, a probabilistic record linkage as performed by the Stata command reclink2
18 The database with regards to citations shows the cited patent, the citing patent, and a citation date. The Stata command collapse allows us to count citations, by reporting frequencies of observations per patent ID.
This was employed both for cited and citing patents.
was employed (Wasi & Flaaen, 2015). Generally, and thus in probabilistic linkage as well, Stata matches pairs correctly only if the formatting in both datasets is consistent (i.e. the names must resemble each other to some degree): In a first step, we hence capitalized all organization names in our sample as well as the established patent database.
In a second step, the Stata command reclink2 was used to derive the best matches for each parent company (organization) of the 1,089 different CVC units. The reclink2 command computes a number from zero to one based on the degree of similarity between the two values. However, the highest scored name is not always the correct name. To capture correct matches which do not yield the highest score whilst keeping the number of incorrect matches to a minimum, we set the number of matches to three as recommended by Wasi and Flaaen (2015). This means that for each name, a list of three possible “real matches” were presented, based on their degree of similarity with each other.
In a third step, a clerical review of the reported matches was performed. This manual review was used to address and correct four identified issues:
1) Pair-similarity employed by reclink2 is an imperfect metric, as the highest score does not necessarily equal the correct match (Wasi & Flaaen, 2015). Out of the three matches, we manually chose and retained the match that indeed equalled the parent organization. This included, for example, cases in which the organization name contains a common ending such as HOLDING. For instance, "AB Holding" is more likely to be matched with "XY Holding" than "Alpha Beta", which could be the real parent organization.
2) The PatentsView disambiguation algorithm did not capture all versions of the
organization names’ spelling. In case of name ambiguity, i.e. multiple assignee IDs per organization, all matches were kept.
3) We corrected the merged, acquired or renamed companies and adjusted the name to the parent organization within the investment period. Multiple lines of observations where created in case of overlaps.
4) In a few cases, the same assignee ID was used for different organizations (due to minor flaws in the disambiguation algorithm). To eliminate this error source, these were removed from the sample.
After linking both datasets, we replaced the assignee ID by PatentsView with the unique organization ID in our sample (idorganization) to clearly identify different organizations in the dataset, including those with multiple assignee IDs. On an organizational level, we created the following variables with regards to granted patents, both as of application and granted date (marked in variable names through the extension suffixes _app resp. _g):
- Cumulative number of granted patents in any given year (cum_patents)19
- Cumulative sum of forward citations (cum_fc), forward self-citations (cum_fsc), backward citations (cum_bc) and backward self-citations (cum_bsc) of granted patents in any given year
- Number of distinct USPC classes of patents in any given year (cum_dist_upsc)
- The standard deviation of the dispersion of patents on different USPC classes up to any given year (sd_tot_uspc). This measure hence takes patent dispersion (i.e. how many patents were filed in each USPC class) into account. This is calculated by counting the total number of patents per distinct USPC main class at any given year and then calculating the standard deviation from the mean of that patent count.20
This third iteration reduced our sample by the CVC units with parent organizations to which we could not assign patent information and resulted in our final sample. In total, the parent
organizations of 706 corporate investors could be matched with the patent database, out of which 34 observations (4.82%) were identified manually (i.e. in the case of merged, acquired or renamed companies, and no correct reclink matches at all). Out of the 706 investors, 496 (70.25%) are
19 One might argue that this is a total number up until any given year. However, we use the word cumulative (which is not incorrect) to enhance the understanding of the difference between this variable and the variable used in the analysis later.
20 Suppose an organization has (x1,x2,…,xn) assigned patents in N distinct main USPC classes at a certain point in time, resulting in a mean number of patents per USPC class 𝑥̅. Then, the sd_tot_upsc was calculated as 𝑠 = √∑𝑁𝑖=1(𝑥𝑖−𝑥̅)2
𝑁−1 .
internal CVC units and 210 (29.25%) operate as a subsidiary. With regards to validation levels, only 37 observations (5.24%) could not be second-source validated (level 0). A summary of the three iterations to deduce the final data sample can be seen in Figure 3.
Evidently, the share of internal units rose for each iterative step. This can be explained largely by the following aspect: Many names suggested external units, which, upon validation, proved to be wrong (i.e. companies choose “externally-sounding” names even though the unit is, in fact, internal).
Figure 3: Summary of data sample iterations