Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Identifying couples in administrative data

Identifying couples in administrative data J Labour Market Res (2017) 50:29–43 DOI 10.1007/s12651-017-0218-4 ARTICLE 1 2 3,4,5,6,7 Deborah Goldschmidt · Wolfram Klosterhuber · Johannes F Schmieder Accepted: 10 January 2017 / Published online: 15 May 2017 © The Author(s) 2017. This article is an open access publication. Abstract We develop a new method for identifying mar- JEL Code J12 ried couples in administrative data. Using address and name data from the universe of employment records in Germany we find around 3.3 Mio. pairs of individuals who are living Identifizierung von Ehepaaren in Administrativen at the same location, have a matching last name and are less Daten than 15 years apart in age. We show supporting evidence that around 89 to 94% of these pairs are indeed married Zusammenfassung Wir entwickeln eine neue Methode couples and provide careful consistency checks. Using in- zur Identifizierung verheirateter Paare in administrativen formation from the German Microcensus, we show that our Daten. Mittels Adressdaten und Nachnamen der Gesamt- method identifies about 17% of all married couples in Ger- heit der Beschäftigungsmeldungen in Deutschland, identi- many and about 35% of couples where both spouses are fizieren wir ca. 3,3 Millionen Paare von Personen die an in social security covered jobs or unemployed. In ongoing der gleichen Adresse wohnen, deren Nachnamen überein- work this couple identifier will be made available to the stimmen, und einen Altersabstand von weniger als 15 Jah- research community and users for the IAB administrative ren haben. Wir zeigen mittels verschiedener Konsistenz- data. Our method thus opens the door for household level checks, dass ca. 89 bis 94 Prozent dieser Paare tatsäch- analyses benefiting from the precision and very large num- lich verheiratete Paare sind. Anhand von Informationen des ber of observations available in administrative data. Mikrozensus, zeigen wir, dass unsere Methode etwa 17 Pro- zent aller verheirateten Paar in Deutschland identifiziert und Keywords Couples · Geocoding · Administrative data · ca. 35 Prozent aller Paare bei denen beide Partner in so- Household analysis zialversicherungspflichtiger Beschäftigung oder arbeitslos sind. Der Paarindikator wird der Forschungsgemeinschaft und Nutzern der IAB Daten zur Verfügung gestellt. Unsere Methode eröffnet damit neue Forschungsmöglichkeiten für Johannes F Schmieder Haushaltsanalysen die von der Präzision und großen Beob- johannes@bu.edu achtungszahlen von administrativen Daten profitieren. Analysis Group, New York City, USA IAB, Nuremberg, Germany 1 Introduction Department of Economics, Boston University College of Arts & Sciences, 270 Bay State Recent years have witnessed a dramatic rise in the use of Road, Boston, Massachusetts 02215, USA administrative data in economic research, facilitated by in- NBER, Cambridge, USA creases in computing power and the availability of new IZA, Bonn, Germany administrative data sources. The main advantages of ad- CESIfo, Munich, Germany ministrative data have been large sample sizes compared to CEPR, London, United Kingdom survey data, often covering the entire universe; the ability to K 30 D. Goldschmidt et al. follow the units of observation over time and the high qual- lowed for the wife to keep her birth name as part of a dou- ity of recorded information. This shift has been particularly ble (or hyphenated) last name, but she was still required to forceful in Labor and Public Economics, where the avail- take on her husband’s name as the family name. The family ability of individual level employment and tax records has name law was revised again in 1970 allowing that a cou- led to the rise in new research designs such as regression ple could decide to take on the wife’s name as the family discontinuity, regression kink or bunching designs that rely name, but kept the requirement of a common family name on very large sample sizes. While administrative data offer for both spouses. Furthermore, if a couple could not come many advantages, they also come with limitations and the to an agreement with respect to which name would become scope of available variables is often quite limited compared the family name the decision was up to the husband. This to household surveys. In particular, administrative employ- only changed with a decision by the German constitutional ment records are typically on the individual level only and court in 1991 and a subsequent revision of the family name it is often not possible to link individuals to other house- law in 1994, after which both spouses were allowed to keep hold members. For this reason, administrative data have their own birth names, while the traditional option of taking played a smaller role in studying traditional questions in la- on one of the birth names or a hyphenated double name for bor economics, such as household labor supply, household one of the spouses continued to exist. In practice it appears investment decisions in human capital or within household that it is still the case that the vast majority of women take income differences. on their husband’s names either fully or at least as part of In this project we develop a new method to impute house- a double name. While we are not aware of representative hold identifiers in the administrative employment records surveys or official registry data for Germany that would al- data in Germany to increase the scope of research ques- low us to calculate the share of couples with matching last tions that can be addressed. Our approach is to identify names, we found various press reports from city level wed- pairs of individuals who are, with a high probability, mar- ding registries that seem to suggest that even among newly ried couples using information on addresses, family names wedded couples around 85 to 90% still have a matching last and dates of birth. In Germany it is still very common that names. Among couples married for longer (and in partic- at the time of marriage one spouse (in the vast majority ular before 1994), the ratio is likely significantly higher. of cases the wife) adopts the other spouse’s last name, ei- We implement the method of identifying likely couples ther fully or as part of a double name. If two individuals using last names, addresses and age using a cross-section of with matching last names are living together at the same the administrative data from the Institute for Employment address, they are likely related, though they could also be Research (IAB) in Germany spanning the universe of em- in a sibling or parent-child relationship. To further narrow ployment and unemployment records for 2008. This data, it down to married couples we take pairs of a woman and called Integrated Employment Biographies or IEB, covers a man with matching last names with an age difference of all individuals who are employed in employment subject to less than 15 years, which should exclude most parent-child social security contributions, receive benefits from the un- relationships. We present a detailed analysis of the likely employment insurance (UI) system, or who are registered extent of errors when applying this method. The new identi- as job seekers. This data covers around 80% of employ- fiers for married couples will be made available to external ees, in particular excluding public servants and the self- researchers and users of the IAB administrative datasets, employed. By design we are only able to identify married facilitating a broad range of possible research projects that couples where both spouses are covered in the IEB. While rely on household/couple identifiers. Something to which this is certainly not a representative sample and excludes we return to in the conclusion. a sizable part of the population of couples we are still able Germany has a long tradition of women taking on their to identify over 3 Mio. couples who are likely married to husbands’ last name at the time of marriage. The German each other. Civil Code from 1896 unequivocally required that the wife The two main concerns with this approach are the poten- takes on the name of her husband. A reform in 1953 al- tial for false positives and false negatives. False positives may arise because people with matching last names may live at the same address either purely by chance, or be- While some countries do allow for linking households in their admin- istrative registry data, resulting in exciting and influential work, these countries tend to be relatively small and geographically clustered, such 3 All-in (2006) report that in Kempten in 2006 around 14% of newly as Austria (Frimmeletal. 2014) or the Scandinavian countries (e. g. married couples keep separate names. Janisch (2010) reports that Hardoy, and Schøne 2014 or Huttunen and Kellokumpu 2016). Ex- a small survey among marriage registries several German cities yielded panding the scope of administrative data to other countries will be very that around 10 to 20% of couples keep separate names. This also seems valuable to study the household behavior in new contexts. to refer to newly married couples, which suggests that the ratio of cou- See Sperling (2012) for a discussion of the legal history of the family ples with separate names among the pool of existing couples is likely name law in Germany. much lower. K Identifying couples in administrative data 31 cause they are related to each other but not married. Using 2 Data sources the distribution of same-sex matching name pairs, as well as information on family status for a subset of individuals In this chapter, the sources of the data are explained in we show that likely around 88–94% of our sample of cou- detail. Sect. 2.1 describes the Integrated Employment Bi- ples are indeed married to each other. Even if both spouses ographies (IEB) data, while the geocoded location data and of a married couple are in our data, false negatives may the individual name data are discussed in 2.2 and 2.3. arise, because we may not match them to each other. Either they do not have matching names or there are more than 2.1 Integrated employment biographies 2 matching individuals at a location, making it impossible to tell who is married to whom. False negatives will also The IEB of the Institute for Employment Research stem arise whenever one or both members of a marriage are not from the notification process of the social security system covered in the IEB data, which for example would include of the Federal Employment Agency (BA). The IEB is the all self-employed, public servants or individuals not in the basis for most of the widely used research datasets provided labor force, but also all individuals older than age 65. Us- by the IAB to the research community, such as the SIAB ing information from the Microcensus, we show that we data, the LIAB data, the BHP and many others. The IEB can identify roughly 20% of the 19 Mio. married couples consolidate completed, historicized and edited process data in Germany. Furthermore, we identify about one third of from different data sources, which come from different op- married couples where both individuals are covered by the erative systems. It comprises all persons registered with the IEB data (i. e. working in social security covered job or un- Federal Employment Agency due to the following: employed). We compare observable characteristics of our Employment subject to social security or marginal part- matched couples with the official microcensus data to show time employment. how our sample differs from the general population of mar- Receipt of unemployment insurance benefits in accor- ried couples. While the representativeness of the matched dance with Social Code Book II or III. couples is clearly limited, many research questions do not Job search registered with local employment agencies. rely on having a representative sample. The large number ● Planned or actual participation in an employment or of observations and the possibility to observe complete em- training programs. ployment histories in the IAB data should make this data a valuable tool for many research projects. We will return to The IEB includes demographic variables such as nation- a discussion of how this data can be used in the conclusion. ality, birthdate, gender, and education. Information on em- This paper is related to other research that uses the ployment, benefit receipt and job search include daily wage, special features of administrative data to impute informa- daily benefit rate, occupational and employment status or tion that is not directly available. For example, Jacobson, economic activity. Additionally location data such as place Lalonde and Sullivan (1993) use the combination of individ- of residence or work on different aggregated levels are pro- ual and firm identifiers in UI records from Pennsylvania to vided. There were around 35 Mio. working individuals in impute plant closings and mass-layoffs by observing when Germany in 2008 (own calculations based on Microcensus large numbers of individuals are moving away from firm data), about 80% of whom have at least one record in the identifiers and are scattered across many other employers. IEB. The biggest groups which are not included in the bi- Hethey-Maier and Schmieder (2013) use a similar approach ographies are self-employed workers and public servants to identify new plant openings in administrative data, rely- called Beamte. ing on worker flow information to distinguish plant open- We also have information on family status (married, liv- ings from spurious changes in firm identifiers. Goldschmidt ing alone, single parent, cohabitating), but only for the sub- and Schmieder (2015) identify outsourcing of labor services set of individuals who are unemployed and registered as in large firms employing an algorithm based on a combina- job seekers. We use this information in Sect. 4 for various tion of worker flows, industry and occupation codes. consistency checks. The next section describes the data used in this project. Sect. 3 describes our method for identifying couples and 2.2 Geocoded data presents the results based on individuals in 2008. In Sect. 4 we show supportive evidence that our method does in fact Our method relies on finding individuals living at the same largely identify married couples and develop bounds on the location. In principle individuals can be matched to other fraction of false positives. We then present characteristics of individuals at the same location either by directly com- the couples that we identify with our method and compare paring addresses, or by first geocoding addresses into lati- them to the general population in the German employment data, as well as to other data sources. Sect. 5 concludes. See Schild and Antonio (2014)p. 3. K 32 D. Goldschmidt et al. tude/longitude coordinates and then comparing coordinates. has a double name from a previous marriage). We thus as- Matching addresses directly is complicated by the fact that sume that double names are always separated by a hyphen these can often be written in a variety of ways and need and we describe below how we use hyphenated names in to be carefully cleaned. We instead match individuals on our name-matching algorithm. At the end of the cleaning geographic coordinates, where the address processing was process all letters were converted to upper case. done using GIS software, which allows for careful error Although individuals have a consistent personal iden- correction methods. The geocoding was done in a project tifier, the Einheitliche Statistische Person (ESP), the last between the Research Data Centre (FDZ) and the Univer- name may vary across different data sources. If, after the sity of Duisburg-Essen for a cross-section of all individuals name cleaning process was completed, discrepancies per- in the IEB data as of June 30th, 2008. This project used data sisted in the names across data sources, the individual was from the Federal Agency for Cartography and Geodesy, and dropped. The exception was when an individual had a dou- includes 22 Mio. addresses of German buildings and their ble last name in one source and an overlapping single last geographic coordinates and it was possible to successfully name in another (e. g. MUELLER-MEIER in one source geocode 94.6% of the IEB records. Individuals whose ad- and MEIER in another). In this case, the double last name dresses are not geocoded were dropped from the data and was kept. are not used in the further analysis. 2.3 Names 3 Identifying couples One of the criteria that we use for determining couples is As mentioned previously, although the IEB data consists whether the last names of two people match. We therefore of a large amount of information on the majority of the also obtained data on last names covering the universe of German population, it – like many administrative data sets individuals who have a record in the IEB as of June 30th, – does not include any information on the household. To 2008. In order to improve the probability of success in circumvent this issue, we combine the IEB data with the matching, we first clean the names of errors and typos, and geocoded location data and information on names to infer ensure consistency in terms of special characters and titles. probable married couples. We use the following criteria to With the support of the German Record Linkage Centre ensure that the matches we identify are most likely married (GermanRLC) and their algorithm, the names of the indi- couples and not simply two people with some other type of viduals were cleaned, taking into account certain patterns relationship (or no relationship at all): and potential discrepancies. Umlauts were substituted (ä ! 1. Same home location. ae and so forth) as well as ß to ss. All blank spaces in the 2. Uniquely matching last name. front, middle or end of the name were removed. Profes- 3. One male, one female, with an age difference of less than sional and nobility titles (such as Dr., Prof., Freiherr von) 15 years. were removed as well, and special characters (e. g. ~ or %) and non-ASCII characters (e. g. © or ™) were deleted. We go into more detail on each of these requirements The only special character that was retained is the hy- below. phen (-), which is used to indicate double names. While the family name law in the civil code book states that a spouse 3.1 Location can add their birth name to the family name does not specif- ically mention a hyphen, in practice this appears to be the The first step in identifying potential married couples is only option. In fact a court decision from 2013 specifically finding people who live at the same location, since most ruled that a couple was not allowed to combine the birth married couples live together. We start by looking at the names of two spouses without a hyphen (Kammergericht distribution of the number of individuals at a particular lo- Berlin 2013). Furthermore individuals are not allowed to cation, using each person’s geocoded coordinates, for the create last name chains that involve more than one hyphen ~33 Mio. people in our data. The second column of Table 1 (for example if at the time of marriage an individual already shows this distribution. Coordinates with a small number of individuals likely represent single-family homes, while coordinates where a larger number of individuals live are See Scholz et al. (2012). That paper is based on geocoded data from likely apartment buildings or other multi-unit residences. 2009, but 2008 was also geocoded as part of the same project. We de- About 5 Mio. individuals live alone at a coordinate – we cided to use 2008 as a baseline to allow for more analysis years after the eliminate these people from our set of potential couples, couples are identified which seemed useful for many possible research questions. In the future we hope to expand the procedure to more years. leaving us with about 28 Mio. individuals. About 7.4 Mio. See for example Schild and Antonio (2014)p. 4ff. individuals live at a location with exactly 1 other person in K Identifying couples in administrative data 33 Table 1 Distribution of the Number of Total number of indi- Number of Percent matched (%) Number of Individuals at the individuals at coordi- viduals individuals with Same Coordinate nate matched names 1 4,956,761 – – 2 7,443,038 5,082,600 68.29 3 4,911,162 1,024,758 20.87 4 3,061,944 651,742 21.29 5 1,998,695 473,896 23.71 6 1,589,814 396,944 24.97 7 1,345,134 347,244 25.81 8 1,154,712 305,390 26.45 9 971,325 259,734 26.74 10 807,360 219,600 27.20 11 673,090 182,466 27.11 12 548,928 147,280 26.83 13 451,828 120,658 26.70 14 366,646 96,724 26.38 15 304,245 79,844 26.24 16 254,032 66,272 26.09 17 209,984 53,700 25.57 18 177,840 45,022 25.32 19 151,734 37,638 24.81 20 131,940 32,064 24.30 >20 1,540,207 372,596 24.19 Total 33,050,419 9,996,172 30.25 Second column includes all geocoded data as of June 30th 2008. Third column includes all individuals with geocoded location for whom we were able to match according to our name-matching algorithm, described in the text the dataset; as the number of people living at a coordinate MEIER, who is dropped from our potential group of cou- gets larger, the absolute number of people living in this type ples. In Example 2.2 (Table 2), we again have 5 individu- of residence decreases. als living at the same coordinate: three have the last name COHLE, one has the last name HART, and one has the last 3.2 Names name HART-MEIER. Because there are more than 2 indi- viduals at this location with the last name COHLE, we can Next, we look at the cleaned names of the individuals liv- not be certain which of these are part of a couple and which ing within any given location. We require that our identified are not, so we drop all three. Because HART and HART- married couples share a last name. In situations where any MEIER share a partial name, even though one is hyphen- of the people in the location has a hyphenated name, we ated, they are kept as a potential match. In Example 2.3 consider two names to be a match if at least one part of (Table 2), there are again 5 individuals at the same coor- the hyphenated name is identical to another name at the dinate. Because COHLE, COHLE and COHLE-MEIER all location. In locations with multiple people, we addition- match in terms of their names, we must eliminate all three, ally require that a maximum of two people have matching since we have no way of knowing which two could really names. Otherwise, we have no way to determine which two be a couple. Similarly, MEIER, MEIER-MUELLER and individuals are likely to be a couple and which may be un- COHLE-MEIER must all be dropped, despite their names related, or related in other ways. The following examples matching. Therefore, in this example, there is no match help to clarify the procedure. chosen. In Example 2.1 (Table 2), there are 5 individuals living After running this algorithm over the 28 Mio. individu- at a particular coordinate. Two have the last name COHLE, als, we are left with about 5 Mio. pairs (ten million indi- and there are no others named COHLE at this location, so viduals) who share a location and last name. The third and they are kept as a potential match. Two are named HART, fourth columns of Table 1 show the number and percent of with no others named HART, and so they are also kept as people that were matched through this algorithm, organized a potential match. Finally, there is a single person named by the number of individuals at a location. For coordinates K 34 D. Goldschmidt et al. on gender and age, will eliminate some of these falsely Table 2 Examples of the name-matching procedure matched people from our sample, but not all. Number of individu- Last name Potential couple als at coordinate a 3.3 Gender and age Example 2.1 5 COHLE Match Finally, we take our set of potential couples – groups of 5 HART Match two people who share a last name and a location – and 5 COHLE Match impose gender and age restrictions. Since we are currently 5 MEIER No match only identifying heterosexual couples, we require that each 5 HART Match couple be composed of one male and one female, informa- Example 2.2 tion that is available in the IAB records. The second column 5 COHLE No match of Table 3 presents the gender composition breakdown for 5 HART Match the 5 Mio. identified potential couples. More than 4 Mio. of 5 COHLE No match these pairs consist of one male and one female, while the 5 COHLE No match remainder is made up of either two males or two females. 5 HART-MEIER Match We drop the single-sex households and move on to the age Example 2.3 difference requirement. 5 COHLE-MEIER No match We first look at the distribution of age differences among 5 MEIER No match matched pairs by gender composition. Fig. 1 graphs the dis- 5 COHLE No match tribution of the age difference between the two members of 5 COHLE No match the couple, defining the difference as the man’s age minus 5 MEIER-MUELLER No match the woman’s age. The majority of the mass lies between These are provided as examples only, and are not taken from the actual data –15 and +15. This likely includes the majority of married matches HART and COHLE are chosen couples, although it could also include brother-sister pairs match (HART-)MEIER is chosen no match is chosen (or other closely-aged family members, such as cousins). It may also include some unrelated people who simply live with only 2 individuals, almost 70% had matching names. in the same location and have the same last name. There is At coordinates with 3 or more people found at the same a smaller mass for pairs where the female is 20–40 years location, the match rate is between 20 and 30%. older than the male, which is likely to include mothers liv- There are several limitations to this criterion. First, while ing with their sons, and an even smaller mass for pairs the majority of married couples in Germany share a last where the male is 20–40 years older than the female, which name (or part of a double name), not all women (or men) likely includes father-daughter pairs. These parent child re- change their last name upon marriage, and we are certain lationships may either be single parents or families where to miss those couples. Second, in locations with multiple only one of the parents are working in employment covered people where more than two share a last name, since we in the IEB. The fact that there seem to be more mother-son can not be certain which two members are married (if any) pairs than father-daughter pairs is likely explained by the we must drop them all, eliminating more potential matches fact that there are more single mothers than single fathers. from our sample. Finally, we may be capturing two peo- Figs. 2 and 3 show the age difference distribution for ple with the same last name living in the same coordinate matched pairs with the same gender, where the age differ- who are related but not married. In addition, particularly ence is defined as the older age minus the younger age. in multi-unit residences, there may be two people who are For both of these, the majority or pairs fall between 15 and unrelated but have the same last name, and we may erro- 40, which is likely to consist mainly of mother-daughter neously be including them in our sample. Our next criteria, or father-son pairs. There is also some mass for pairs with Table 3 Gender Composition of Matched Potential Couples Matches All matches Age Difference <15 Age Difference ≥15 Absolute Percent (%) Absolute Percent (%) Absolute Percent (%) Male/female 4,084,516 81.72 3,281,657 94.65 802,859 52.44 Male/male 482,891 9.66 131,550 3.79 351,341 22.95 Female/female 430,679 8.62 53,763 1.55 376,916 24.62 Total 4,998,086 100.00 3,466,970 100.00 1,531,116 100.00 Includes all individuals with geocoded location for whom we were able to match according to our name-matching algorithm, described in the text K Identifying couples in administrative data 35 Fig. 1 Distribution of age differences of matches, male/female. Fig. 3 Distribution of age differences of matches, male/male. (Note: (Note: Includes all male-female pairs of individuals who we were Includes all male-male pairs of individuals who we were able to match able to match by location and name (according to our name-matching by location and name (according to our name-matching algorithm). algorithm). Age difference is calculated as man’s age – woman’s age) Age difference is calculated as older age – younger age) ing with each other in this age group are in fact married to each other. For determining our sample of couples, we require that the difference in age of the matched man and woman be less than 15 years. This should eliminate any mother-son or father-daughter pairs from the set of couples. The re- maining pairs – consisting of one man and one woman, with matching last names, who live in the same location and are less than 15 years apart in age – make up our fi- nal sample. Columns 4–5 of Table 3 show the results when we impose our age difference restriction. We retain 80% of our male-female couples, leaving us with a final sample of about 3.3 Mio. couples. This sample should be primar- ily composed of true couples, although some share will be “false positives”, made up of male-female siblings or fam- Fig. 2 Distribution of age differences of matches, female/female. ily members who are similar in age, or unrelated people (Note: Includes all female-female pairs of individuals who we were with the same name living at the same coordinates. able to match by location and name (according to our name-matching algorithm). Age difference is calculated as older age – younger age) 4 Consistency checks an age difference of 0–15 years; these may be siblings or other familial relationships, homosexual couples, or other Errors in our matching algorithm could occur in two ways. pairs of people who coincidentally have the same last name First, we have false positives – two people who are matched in the same location. While homosexual couples can form to each other by our algorithm, but who are not really a mar- a civil union in Germany since 2001 which allows them to ried couple. Second, there are couples that we do not pick adopt a common family name, these still seem to be rela- up with our matching method, for various reasons. We dis- tively rare, with only 34,000 same sex civil unions in 2011 cuss these two issues, and the steps we take to quantify (Statistisches Bundesamt 2012). Thus while a small part of their magnitude, below. the same sex matches might be same sex couples most of them are not. The fact that the number of same sex matched 4.1 False positives individuals in our sample is quite small, suggests that there are relatively few cases where people live together with the One type of error that could occur is when our algorithm same last name for other reasons than being married to each matches two people who are not really married to each other and that in turn most matched individuals who are liv- other, also known as type 1 error. Pairs in our sample may K 36 D. Goldschmidt et al. Fig. 4 Match Accuracy cal- culated based on Same-sex Matches by Number of Individ- uals living at same coordinate. (Note: In this figure we calcu- late the likely probability that a matched couple is indeed married to each other. For this we assume that the number of matched male/male and female/ female couples obtained by us- ing the same algorithm as for matched male/female couples is a proxy for the number of false matched. The accuracy can then be calculated as: [N(f/m) – N(f/f) – N(m/m)]/N(f/m). The figure plots this accuracy rate as a function of the number of individuals who live at the same coordinate where the couple is matched) be wrongly matched if: (1) they are brother and sister, or false positives by a small amount. Using this methodology, have some other family relationship, are close in age, and our accuracy rate is around 94% (final sample is 3,281,657; live in the same location; or (2) they are unrelated, but living estimated wrongly matched is 185,313; correctly matched = in a multi-unit residence, such as an apartment building, and 3,281,657– 185,313 = 3,096,344; accuracy rate = correctly happen to have the same last name and are close in age. matched/final sample = 3,096,344/3,281,657 = 94%). So, We can try to measure the size of this type of error in according to this method, only about 6% of our sample is our final sample of couples in a few ways. First, we can use wrongly matched and our sample does indeed identify cou- the distribution of same-sex matches to give us a sense of ples who with a high degree of certainty are indeed married what share of our sample are wrongly matched if we make to each other. the following two assumptions. The first assumption is that We can also use this approach to get a sense of whether opposite-sex family members who are close in age (i. e. the accuracy of matches varies by the number of individuals brother and sister) are as likely to live together as same-sex living at the same coordinate. Intuitively in large apartment family members (two sisters, for example). The second is buildings with many units it is more likely to have two indi- that it is as likely for two people of the opposite sex who live viduals with matching last names who are unrelated. Fig. 4 in the same building to share a last name as it is for two peo- shows the match accuracy by the number of individuals at ple of the same sex. Using these assumptions, we can look the same coordinate. The accuracy rate is clearly the highest at the number of same-sex matched pairs that fall within at coordinates with just 2 individuals with a match accuracy our age difference restriction (ages within 15 years of each rate of 95%. At coordinates with 3 individuals the match other), using the numbers provided in Table 3 – these cou- ples are likely either pairs of family members living in the Statistisches Bundesamt (2012) states that there are about 34,000 same location, or unrelated people with the same last name same sex civil unions in Germany in 2011. We do not know how com- in the same building. We find that there are 185,313 male/ mon it is for same sex couples to adopt a common family name, nor that they would both be employed and covered in our data. It appears male and female/female pairs that fall within our age restric- that due to the small number of same sex civil unions our method for tion. So, it is likely that approximately 185,000 couples in identifying male-female marriages would not work as well for identi- our sample of matched male-female couples with age differ- fying same sex civil unions. ence under 15 years are also wrongly matched. In fact, since 8 Here we assumed that two opposite sex individuals with matching there are some same-sex civil unions where partners share last names who are not married are equally likely to live together as two same sex individuals, averaging over male-male and female-fe- a family name, this arguably overestimates the number of male pairs. A more conservative assumption would be to assume that opposite-sex pairs that are not married are as likely to live together as male-male pairs, i. e. 2*131,550 = 263,100 leading to an accuracy rate of 92%. We thank an anonymous referee for pointing this out. K Identifying couples in administrative data 37 with age difference under 15 years, who are listed as both Table 4 Family Status of Individuals in Matched Couples Sample married or one married-one missing only 9% of the time. Family Status Absolute Number of Percent among non- Male-female couples with an age difference of 15 years or Individuals missing (%) more are listed as both married or one married-one missing Living alone 340,722 21.98 25% of the time. This could either indicate that there are Cohabiting 113,153 7.30 some married couples with an age difference of larger than Single parent 109,783 7.08 15 years, but could also be because these are indeed parent- Married 986,480 63.64 child relationships where the spouse is not covered in the Missing 8,446,034 – data (or does not share a last name). Total 9,996,172 – Using the information in Table 5, we can also estimate Includes all individuals who we were able to match by location and the share of matches in our final sample that are likely to name (according to our name-matching algorithm). Only individuals be true couples and not wrongly matched people (i. e. our who are registered as job-seekers have the family status variable filled in “accuracy rate”) using the subsample of couples with at least one family status listed. If we think that the family accuracy rate appears significantly smaller, which may be status variable is accurate, then the set of “true” couples because these are still likely single family homes with one in our sample should be 578,088: the number of couples or more of the children working which may lead to more who are listed of either being both married or one mar- male/male or female/female matches. For coordinates with ried, on missing family status. Even within these there may more individuals living at the same location the accuracy be individuals who were mistakenly matched. For exam- rate falls slightly but remains above 90% at least until 50 in- ple, there may be a job-seeking man with the last name dividuals at the same coordinate. Past that the number of MUELLER, whose wife is out of the workforce (and hence observations becomes quite small and the estimated accu- is not included in the IEB data), living at the same coor- racy rate becomes quite noisy, though it continues to hover dinates as a similarly-aged jobseeker woman with the last between 85 and 95%. Future researchers may want to re- name MUELLER whose husband is not in the IEB data ei- strict their analysis sample to couples with fewer number of ther. Our matching algorithm would connect these two job- individuals at the same coordinate if they want to maximize seekers, who are both listed as being married, even though the accuracy rate. they are not actually married to each other. If we think Next, we use the “Family Status” variable to perform an that it is as likely for two individuals of the same gender additional check on the validity of our sample. This vari- to be wrongly matched in this way as it is for two oppo- able is available as part of the Jobseeker-History ((X)ASU) site-gender individuals, then we can use the information dataset, and thus is only filled in for a small subset of people on family status for same-sex pairs for our accuracy esti- – those who are registered as job seekers as of June 30th, mate. Specifically, there are 5173 (637 + 4536) same-sex 2008. From our sample of approximately 10 Mio. matched matched pairs with age difference less than 15 years where individuals, about 1.5 Mio. have the family status variable family status is listed as both married or married-missing. filled in. The variable takes on four possible values: living Since we know that these are wrongly matched pairs, we alone, cohabiting, single parent, or married. Table 4 depicts can assume that the same number of opposite-sex pairs was the distribution of family status values across all individuals wrongly matched as well. So, the estimated “true” number with a matched name within their location. Although 85% of couples in the subsample of couples with family sta- are missing the family status variable, of those in the data tus is 572,915 (578,088 matched M-F with age difference with a family status listed, approximately 64% are listed as <15 and family status married-married or married-missing married, 22% are listed as living alone, while the rest are ei- minus 5173 same-sex pairs with age difference <15 and ther cohabiting or are single parents. We investigate further married-married or married-missing status). Since our full by looking at the combination of family status for matched sample of matched couples (with family status) is made up pairs, shown separately by gender composition and age dif- of 649,643 (3,281,657–2,632,014) couples, our estimated ference (Table 5). When we look at male-female pairs with accuracy rate is 88.2% (572, 915 “true” couples/649,643 an age difference under 15 years, we see that, for couples total couples in our final sample of couples with family sta- with at least one family status listed, they are listed as ei- tus filled in for at least one of the members), or 11.8% error ther both married or one married-one missing family status rate. 89% of the time. This is far higher than for same-sex pairs 9 10 These are typically either people who are unemployed (in particu- We are again being conservative here, assuming that among the lar unemployment insurance recipients are required to register as job same-sex matched couples, none are true couples (same-sex civil seekers) or who expect to be unemployed soon. unions). As discussed before this is likely a very small group. K 38 D. Goldschmidt et al. Table 5 Family Status Composition, for matched couples Family Status Opposite sex Same sex Combinations Age diff <15 Age diff ≥15 Age diff <15 Age diff ≥15 Absolute Percent (%) Absolute Percent Absolute Percent Absolute Percent (%) (%) (%) Alone-alone 5762 0.89 9073 3.98 9854 17.65 6987 3.51 Alone-missing 26,692 4.11 69,514 30.50 28,148 50.43 61,258 30.76 Alone-cohabit 3124 0.48 6066 2.66 2538 4.55 5197 2.61 Alone-single 1795 0.28 16,050 7.04 594 1.06 14,573 7.32 parent Alone-married 9207 1.42 15,670 6.88 1391 2.49 15,553 7.81 Cohabit-cohabit 3248 0.50 2401 1.05 1337 2.40 2197 1.10 Cohabit-missing 7001 1.08 13,607 5.97 4331 7.76 12,815 6.44 Cohabit-single 757 0.12 9500 4.17 196 0.35 9348 4.69 parent Cohabit-married 5870 0.90 6764 2.97 303 0.54 7370 3.70 Single parent- 85 0.01 58 0.03 219 0.39 399 0.20 single parent Single parent- 5331 0.82 22,240 9.76 1595 2.86 21,261 10.68 missing Single parent- 2683 0.41 1055 0.46 136 0.24 1147 0.58 married Married-married 229,279 35.29 8078 3.54 637 1.14 1111 0.56 Married-missing 348,809 53.69 47,851 20.99 4536 8.13 39,925 20.05 Both Missing 2,632,014 – 574,932 – 129,498 – 529,116 – Total 3,281,657 – 802,859 – 185,313 – 728,257 – The sample includes all couples who we were able to match by location and name (according to our name-matching algorithm). Only individuals who are registered as job-seekers have the family status variable filled in We may expect fewer errors of this type in our matching is not covered in the IEB. In order to get a sense of what algorithm if we restrict our focus to coordinates with ex- share of couples we can identify in our data, we obtained actly two people – in this case, there are likely to be fewer the Scientific Use File of the Microcensus 2008 (see Boehle mismatched pairs of the type described above. When we re- 2010), to calculate the number of married couples in 2008 peat the accuracy rate estimation, restricting our sample to overall and the number of married couples that satisfy the matched couples living at coordinates where exactly 2 peo- sample restrictions that we have to apply in the IEB data. ple live, we find that to be the case: our estimated error rate Overall, there were 19,187,000 married couples in 2008; of is likely a bit lower, around 8.6% (see Appendix Table 7). those, about 9.2 Mio. were such that both spouses would While using the job-seeker data is helpful for estimat- live together, would be less than 15 years apart in age, and ing the likely fraction of false positives, it should be kept would be covered in the IEB data, i. e. either working in in mind that neither is this subsample representative, nor a social security covered job or being unemployed. Since, necessarily is family status measured without errors. It may in our final sample, we have 3.2 Mio. couples, we capture well be the case that we are overestimating or underesti- about one third of the total number of married couples that mating the number of false positives here. Overall, based match our baseline restrictions. on the two approaches discussed, we estimate that the frac- If the couple does not share a last name (or part of a hy- tion of false positives lies somewhere in the range of 6% to phenated name), then we would not capture them with our 11.8%. algorithm. Until 1991 it was required by German law that married couples share a last name, and even afterwards 4.2 Missing couples most change or hyphenate their last name upon marriage. Although we were not able to find official statistics on this Given the data we are using and the matching algorithm topic, according to several newspaper articles the share of we have developed, we are likely to have missed many true new couples who share a last name is around 85 to 90%. married couples, either among individuals who are in our Couples where one or both members are non-German are dataset (a form of type 2 error) or where at least one spouse the least likely to share a last name. K Identifying couples in administrative data 39 restriction does not exclude many true couples. There are more matched pairs in Fig. 1 where the man is around 25 years older than the woman, but Fig. 5 shows that that is ex- actly where the share of married/married is falling to zero, thus suggesting that here we have mainly pairs who are not matched to each other . Couples not living together on June 30th, 2008 are im- possible for us to identify with our data; however, we be- lieve that this situation is likely to be rare. If the couple lives at a location with more than 2 people with the same last name at the same coordinate, we have no way of knowing which two people are part of a couple, and so all are dropped (about 5.2 Mio.). We drop people who have inconsistent names across data sources, thus potentially omitting more couples from our Fig. 5 Share of matched pairs listed as married-married or married- sample (about 1.8 M). missing. (Note: Includes all male-female pairs of individuals who we We can get a sense of how representative our final sample were able to match by location and name (according to our name- matching algorithm), and where at least one member has the family of couples is by comparing their characteristics to those of status variable filled in. Age difference is calculated as man’s age – a truly representative sample of couples, those in the Mi- wife’s age) crocensus. Table 6 compares individual characteristics of people in our final sample of couples (column 3) to couples Couples where the age difference between the husband in the Microcensus in 2008. Column (6) shows all mar- and wife is more than 15 years are omitted from our sample ried couples in 2008, while column (7) shows all couples in an effort to ensure that we do not mistakenly include par- satisfying the restrictions of our algorithm in the IEB. In ent-child pairs in our sample. Although there are certainly terms of the age distribution, our men and women tend to married couples with a 15-year or larger age difference, the be a younger than those of all census couples; this can be number of these types of couples is quite small. For exam- explained by the fact that our sample only includes people ple, in the micro census, a representative survey of German in the workforce, so older workers who are more likely to households, the share of couples with a 16-year or more be retired are excluded. In addition, anyone married to a re- age difference was only 2% in 2008. tired person will be omitted from our final sample, since We also investigated the likely impact of our age restric- their spouse will not be in our original dataset. Comparing tion using the marital status variable available in the job the last column where we apply the same restrictions as in seeker data. For the subsample of couples where we have our matching algorithm, we find that the age distribution is the marital status for at least one of the two individuals, much closer to our matched couples. in Fig. 5 we plotted the share of couples where either both Looking next at the labor force status, we do not have were reported as married or one person was married and the the full range of labor force status options that are available other person’s marital status was missing. Matched couples in the micro census, since the IAB data only includes peo- where both are married seem to be very rare when the ple in the labor force but omits self-employed and public woman is older than 15 years than the man. This suggests servants. The couples in the last column of Table 6 look that there are almost no true couples that we are missing reasonably similar in terms of labor force status as our with the 15 years age difference restriction. On the other matched couples sample, although they are somewhat less end there is still a high share of couples where the man is likely to be unemployed. This might be because some long- around 15 to 20 years older than the woman where both are term unemployed who are in the IEB might be identified reported as married. If these are true couples, then we are as out of the labor force in the Microcensus, or because we excluding them from our set of likely married couples. No- are somehow more likely to identify unemployed individ- tice however that while the share is significant, Fig. 1 shows that there are almost no couples in the 15 to 20 years age This can also be seen from Table 5 if we look at the subsample of window (consistent with the information from the micro our matched couples with the family status variable available. Of the male-female pairs where both are listed as married, only 3% have an census), again suggesting that the 15 years age difference age difference of 15 years or more. Appendix Fig. 7 shows the same figure restricted to couples at lo- cations with exactly two individuals at the same coordinate. Consistent with out discussion in section 4.1, the match accuracy appears to be slightly higher at coordinates with just 2 individuals. K 40 D. Goldschmidt et al. Table 6 Comparing Individuals and Couples with Microcensus Individual level: Sample All individu- Final Matched Sample Microcensus Microcensus 2008 als 2008 restricted Restriction 2 People at >2 People at Coordinate Coordinate Number of individuals on 6.44 5.65 2 9.53 – – coordinate Age husband <35 0.33 0.12 0.07 0.17 0.07 0.11 ≥35 and <45 0.26 0.31 0.33 0.29 0.20 0.36 ≥45 and <65 0.39 0.54 0.57 0.51 0.43 0.53 ≥65 0.03 0.03 0.03 0.03 0.30 0.01 Age wife <35 0.31 0.17 0.11 0.23 0.11 0.17 ≥35 and <45 0.25 0.34 0.39 0.29 0.21 0.40 ≥45 and <65 0.42 0.47 0.48 0.47 0.41 0.43 ≥65 0.02 0.01 0.01 0.02 0.27 0.00 Labor Force Status Employee 0.84 0.88 0.93 0.83 0.60 0.93 Unemployed 0.13 0.10 0.06 0.13 0.03 0.07 Education Secondary/intermediate 0.78 0.82 0.81 0.83 0.69 0.72 school leaving certificate Upper secondary school 0.21 0.18 0.20 0.17 0.26 0.24 leaving Living in East Germany 0.15 0.17 0.16 0.17 0.20 0.21 Number of individuals 33,050,419 6,563,314 3,384,124 3,179,190 38,374,000 18,454,000 Couple Level: Restriction All matches –– – – – male/female Age difference no age difference 0.08 0.10 0.11 0.10 0.10 0.10 ≥1 and <4 0.41 0.51 0.52 0.49 0.47 0.51 ≥4 and <7 0.20 0.25 0.25 0.25 0.24 0.25 ≥7 and <11 0.09 0.11 0.10 0.12 0.11 0.11 ≥11 and <16 0.03 0.03 0.03 0.04 0.04 0.03 ≥16 0.19 0.00 0.00 0.00 0.05 – Nationality Both German 0.90 0.90 0.96 0.83 0.87 0.88 One German 0.06 0.07 0.03 0.10 0.08 0.06 Both non-German 0.04 0.04 0.01 0.06 0.05 0.06 Number of couples 4,084,516 3,281,657 1,692,062 1,589,595 19,187,000 9,227,000 – – – – – (SUF: n = (SUF: n = 109,073) 226,787) The table compares mean characteristics of the overall population of individuals in the IEB data in 2008 (Column 1), with the uniquely matched couples (Column 2–4) data and couples from the Microcensus 2008. Column 5 corresponds to all married couples in the Microcensus and column 6 to married couples that satisfy the same restrictions that we impose in our matching algorithm: husband and wife live together and are in social security covered job or unemployed and the age difference is less than 15 years K Identifying couples in administrative data 41 to a single last name, relating the frequency of that name in the overall population (on the x-axis) with the frequency of that name among matched couples (y-axis). The black line represents the 45 degree line. Amazingly almost all names are very close to the 45 degree line, suggesting that neither very rare nor very common last names are more or less likely to be matched. Again, while we are clearly not obtaining a representative sample, it is interesting that we do not seem to be biased against particularly common or rare last names. 5 Discussion and conclusion We present a new method for identifying a very large num- Fig. 6 Frequency of Surnames in Population and Among Matched Couples. (Note: Each dot in the figure represents one unique last name ber of pairs of individuals who are likely married to each in our data (such as Meier or Mueller). The figure shows the frequency other in the German administrative data. While room for of each last name in the overall population (x-axis) and among the sam- type 1 (false positives) and type 2 (false negatives) errors ple of matched couples (y-axis). The black line represents the 45 degree exists, our analysis suggests that our final sample still con- line, suggesting that most names are equally common in the overall population and among matched couples) tains about 89 to 94% actually married couples. An im- portant caveat is that due to the nature of the IEB, our uals as part of couples in the IEB. Interestingly, when we sample of married couples is not representative of all mar- restrict the matched couples data to a sample with exactly ried couples, but at best representative of couples where 2 people at a location (Column 4) the distribution is much both individuals are either working in a job that is cov- closer to the census. ered by social security (that is not civil service job or self In the bottom half of Table 6 we can compare the char- employed) or are unemployed and receiving benefits. Our acteristics of couples in the two different data sets. The comparison with the Microcensus from our baseline year distribution of age difference within couples of our final suggests that our matched couples look reasonably similar sample (column 3) is almost exactly the same as that of the to couples in this more restrictive sample frame, but even Microcensus when using the same restrictions as in our al- then we are more likely to pick up married couples who live gorithm (column 7). The couples in our sample are slightly in smaller buildings, such as single family homes, and thus more likely to be both German and less likely to be both probably couples who are either living in less densely pop- non-German than those of the micro census; as mentioned ulated areas or with higher income levels. Finally, since we earlier, non-Germans are less likely to change their name rely on last names our sample will miss all couples where at marriage than Germans are, and so are more likely to be the spouses do not share a name and this decision is likely omitted by our matching algorithm. Overall, although we correlated with other characteristics of the couple. miss many couples in our data set and may mistakenly in- While the representativeness of this matched couple data clude some pairs who are not truly married, the couples that is therefore clearly limited, many research questions do not we identify seem roughly similar to the universe of couples rely on a representative sample. Most natural experiments in Germany that satisfy the restrictions that are imposed in that have been used by applied researchers only affect a very the matching algorithm. selected subsample of the population (e. g. typical regres- Finally, we performed an additional check to see whether sion discontinuity or regression kink designs), but obtaining our algorithm is more likely to pick up very rare or very causally interpretable parameters with a high degree of in- common last names by comparing the distribution of last ternal validity is still very valuable even if it cannot easily names in the overall population with the distribution of last be extrapolated to the general population. names among matched couples. On the one hand, we might Overall, the method appears accurate enough to open the be more likely to find unique matches in the case of rare door for future research projects analyzing research ques- last names, in which case rare last names would be more tions in labor and public economics that rely on house- common in our matched couples data than in the overall hold (couple) identifiers using administrative data. We are population. On the other hand we might be more likely to working on making these identifiers available to external obtain false positives in the case of common last names, in researchers through the existing IAB research data infra- which case those would be overrepresented in our matched structure. We can readily imagine a wide number of pos- data. Fig. 6 shows a scatterplot, where each dot corresponds sible applications. For example, a long literature has stud- K 42 D. Goldschmidt et al. ied the added worker effect, which is whether spouses of our new identifier could be used is to study relative in- displaced workers respond to the job loss by increasing comes within married couples as for example in Bertrand their own labor supply (see for example Lundberg 1985, et al. (2015). Other areas where important work has been or Stephen 2002). Most existing work in this literature has done with the IAB data that could be extended using our relied on panel survey datasets such as the PSID or GSOEP. couple identifiers include for example the labor supply and Using our identifier, it will be possible to study the added mobility responses to immigration shocks (Dustmann et al. worker effect for a much larger sample of workers after 2016), or the effects of maternity leave policies on labor a variety of well identified shocks such as plant closings supply (Schönberg and Ludsteck 2014). or mass layoffs. Another promising area of research is to We believe that providing access to a new way to study study spillover effects of public programs. For example, household decisions and responses in administrative data Cullen and Gruber (2000) provide fascinating evidence that will inspire the research community to many new and cre- more generous unemployment insurance benefits reduce la- ative research projects. bor supply of spouses married to the benefit recipient. A lot Open Access This article is distributed under the terms of the of recent work on UI has been done with the German admin- Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted istrative data (e. g. Schmieder et al. 2012, 2016) exploiting use, distribution, and reproduction in any medium, provided you give the large number of observations and clean sources of iden- appropriate credit to the original author(s) and the source, provide a tification such as age discontinuities in potential duration. link to the Creative Commons license, and indicate if changes were With the possibility to link married couples it will be pos- made. sible to use similar research designs to look at questions as in Cullen and Gruber (2000) to understand how households as a whole are affected by policies such as UI, active la- Appendix bor market policies or tax policies. Another example where Table 7 Family Status Composition, for matched couples living at coordinates with exactly 2 people total Different sex Same sex Age diff <15 Age diff ≥15 Age diff <15 Age diff ≥15 Combinations Absolute Percent Absolute Percent Absolute Percent Absolute Percent (%) (%) (%) (%) Alone-alone 1228 0.54 1504 2.08 2385 14.38 1278 2.01 Alone-missing 9956 4.42 30,437 41.99 10,765 64.92 27,634 43.44 Alone-cohabit 412 0.18 624 0.86 338 2.04 542 0.85 Alone-single 293 0.13 1922 2.65 95 0.57 1864 2.93 parent Alone-married 1170 0.52 4106 5.67 280 1.69 3840 6.04 Cohabit-cohabit 431 0.19 226 0.31 132 0.80 213 0.33 Cohabit-miss- 1742 0.77 3622 5.00 903 5.45 3537 5.56 ing Cohabit-single 98 0.04 1103 1.52 13 0.08 1136 1.79 parent Cohabit-mar- 915 0.41 1118 1.54 39 0.24 1122 1.76 ried Single parent- 22 0.01 8 0.01 15 0.09 40 0.06 single parent Single parent- 1595 0.71 4420 6.10 339 2.04 4154 6.53 missing Single parent- 357 0.16 212 0.29 11 0.07 223 0.35 married Married-mar- 47,922 21.25 1404 1.94 77 0.46 211 0.33 ried Married-miss- 159,344 70.67 21,774 30.04 1190 7.18 17,816 28.01 ing Both missing 1,466,577 – 326,445 – 67,477 – 302,644 – Total 1,692,062 – 398,925 – 84,059 – 366,254 – Includes all couples who we were able to match by location and name (according to our name-matching algorithm), restricted to couples living at coordinates where no other people are listed. Only individuals who are registered as job-seekers have the family status variable filled in K Identifying couples in administrative data 43 Hethey-Maier, T., Schmieder, J.F.: Does the use of worker flows im- prove the analysis of establishment turnover? Evidence from Ger- man administrative data. J Appl Soc Sci Stud – Schmollers Jahrb 2013 133(4), 477–510 (2013) Huttunen, K., Kellokumpu, J.: The effect of job displacement on cou- ples’ fertility decisions. J Labor Econ 34(2), 403–442 (2016) Jacobson, L.S., LaLonde, R.J., Sullivan, D.G.: Earnings losses of dis- placed workers. Am Econ Rev 83, No. 4, 685–709 (1993) Janisch, W.: Namenswahl nach der Heirat: Bekenntnis zum Mann (2010). http://www.sueddeutsche.de/leben/namenswahl-nach- der-heirat-bekenntnis-zum-mann-1.79245, Accessed August 1, Kammergericht Berlin: Eheregistereintragung: Schreibweise von Ehenamen und Begleitnamen (2013). http://www. gerichtsentscheidungen.berlin-brandenburg.de/jportal/? quelle=jlink&docid=KORE209412013&psml=sammlung.psml& max=true&bs=10, Accessed September 1, 2014 Lundberg, S.: The added worker effect. J Labor Econ 3, 11–37 (1985) Schild, C.-J., Antoni, M.: Linking survey data with administrative so- Fig. 7 Share of matched pairs listed as married-married or married- cial security data – the project “Interactions between capabilities missing; 2 people at a coordinate. (Note: Includes all male-female pairs in work and private life”. Working paper series, vol. 2014–02. of individuals who we were able to match by location and name (ac- German Record-Linkage Center, Nürnberg, p 11 (2014) cording to our name-matching algorithm), and where at least one mem- Schmieder, J.F., von Wachter, T., Bender, S.: The effects of extended ber has the family status variable filled in. Restricted to couples living unemployment insurance over the business cycle: Evidence from at coordinates where exactly 2 people are located. Age difference is regression discontinuity estimates over 20 years. Q J Econ 127(2), calculated as man’s age – wife’s age) 701–752 (2012) Schmieder, J.F., von Wachter, T., Bender, S.: The effect of unemploy- ment benefits and nonemployment durations on wages. Am Econ References Rev 106(3), 739–777 (2016) Scholz, T., Rauscher, C., Reiher, J., Bachteler, T.: Geocoding of Ger- All-in: Immer mehr behalten ihren Geburtsnamen – Zahl der Ehen man administrative data. FDZ-Methodenreport, vol. 2012–09. In- mit Doppelnamen bleibt seit Jahren gleich (2006). http://www. stitute for Employment Research, Nürnberg, (2012) all-in.de/nachrichten/lokales/Immer-mehr-behalten-ihren- Schönberg, U., Ludsteck, J.: Expansions in maternity leave coverage Geburtsnamen;art26090,215128, Accessed September 1, 2014 and mothers’ labor market outcomes after childbirth. J Labor Bertrand, M., Kamenica, E., Pan, J.: Gender identity and relative in- Econ 32(3), 469–505 (2014) come within households. Q J Econ 130(2), 571–614 (2015) Sperling, F.: Familiennamensrecht in Deutschland und Frankreich: Boehle, M., Schimpl-Neimanns, B.: GESIS – Leibniz-Institut für eine Untersuchung der Rechtslage sowie namensrechtlicher Kon- Sozialwissenschaften (Ed.): Mikrozensus Scientific Use File flikte in grenzüberschreitenden Sachverhalten. Mohr Siebeck, 2008 : Dokumentation und Datenaufbereitung. Bonn (GESIS- Tübingen, p 226 (2012) Technical Reports 2010/13) (2010). http://nbn-resolving.de/urn: Statistisches Bundesamt: Bevölkerung und Erwerbstätigkeit – Haushalte nbn:de:0168-ssoar-207237 und Familien Ergebnisse des Mikrozensus, 1st edn. 3. Statistis- Cullen, J.B., Gruber, J.: Does unemployment insurance crowd out ches Bundesamt, Wiesbaden (2012) spousal labor supply? J Labor Econ 18(3), 546–572 (2000) Stephens Jr, M.: Worker displacement and the added worker effect. Dustmann, C., Schönberg, U., Stuhler, J.: Labor supply shocks, native J Labor Econ 20(3), 504–537 (2002) wages, and the adjustment of local employment. Quarterly Journal of Economics 132 (1), 435–483 (2017) Deborah Goldschmidt Associate, Analysis Group Frimmel, W., Halla, M., Winter-Ebmer, R.: Can pro-marriage policies work? An analysis of marginal marriages. Demography 51(4), Wolfram Klosterhuber Research Fellow, Institute for Employment 1357–1379 (2014) Research (IAB) Goldschmidt, D., Schmieder, J.F.: The rise of domestic outsourcing and the evolution of the German wage structure, Quarterly Journal of Johannes F Schmieder Assistant Professor, Department of Eco- Economics (forthcoming) nomics, Boston University; Peter Paul Career Development Professor, Hardoy, I., Schøne, P.: Displacement and household adaptation: In- Assistant Professor sured by the spouse or the state? J Popul Econ 27(3), 683–703 (2014) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal for Labour Market Research Springer Journals

Identifying couples in administrative data

Loading next page...
 
/lp/springer-journals/identifying-couples-in-administrative-data-NDaY7lrwPw

References (22)

Publisher
Springer Journals
Copyright
Copyright © 2017 by The Author(s)
Subject
Economics; Labor Economics; Sociology, general; Human Resource Management; Political Economy/Economic Policy; Regional/Spatial Science; Population Economics
ISSN
1614-3485
eISSN
2510-5027
DOI
10.1007/s12651-017-0218-4
Publisher site
See Article on Publisher Site

Abstract

J Labour Market Res (2017) 50:29–43 DOI 10.1007/s12651-017-0218-4 ARTICLE 1 2 3,4,5,6,7 Deborah Goldschmidt · Wolfram Klosterhuber · Johannes F Schmieder Accepted: 10 January 2017 / Published online: 15 May 2017 © The Author(s) 2017. This article is an open access publication. Abstract We develop a new method for identifying mar- JEL Code J12 ried couples in administrative data. Using address and name data from the universe of employment records in Germany we find around 3.3 Mio. pairs of individuals who are living Identifizierung von Ehepaaren in Administrativen at the same location, have a matching last name and are less Daten than 15 years apart in age. We show supporting evidence that around 89 to 94% of these pairs are indeed married Zusammenfassung Wir entwickeln eine neue Methode couples and provide careful consistency checks. Using in- zur Identifizierung verheirateter Paare in administrativen formation from the German Microcensus, we show that our Daten. Mittels Adressdaten und Nachnamen der Gesamt- method identifies about 17% of all married couples in Ger- heit der Beschäftigungsmeldungen in Deutschland, identi- many and about 35% of couples where both spouses are fizieren wir ca. 3,3 Millionen Paare von Personen die an in social security covered jobs or unemployed. In ongoing der gleichen Adresse wohnen, deren Nachnamen überein- work this couple identifier will be made available to the stimmen, und einen Altersabstand von weniger als 15 Jah- research community and users for the IAB administrative ren haben. Wir zeigen mittels verschiedener Konsistenz- data. Our method thus opens the door for household level checks, dass ca. 89 bis 94 Prozent dieser Paare tatsäch- analyses benefiting from the precision and very large num- lich verheiratete Paare sind. Anhand von Informationen des ber of observations available in administrative data. Mikrozensus, zeigen wir, dass unsere Methode etwa 17 Pro- zent aller verheirateten Paar in Deutschland identifiziert und Keywords Couples · Geocoding · Administrative data · ca. 35 Prozent aller Paare bei denen beide Partner in so- Household analysis zialversicherungspflichtiger Beschäftigung oder arbeitslos sind. Der Paarindikator wird der Forschungsgemeinschaft und Nutzern der IAB Daten zur Verfügung gestellt. Unsere Methode eröffnet damit neue Forschungsmöglichkeiten für Johannes F Schmieder Haushaltsanalysen die von der Präzision und großen Beob- johannes@bu.edu achtungszahlen von administrativen Daten profitieren. Analysis Group, New York City, USA IAB, Nuremberg, Germany 1 Introduction Department of Economics, Boston University College of Arts & Sciences, 270 Bay State Recent years have witnessed a dramatic rise in the use of Road, Boston, Massachusetts 02215, USA administrative data in economic research, facilitated by in- NBER, Cambridge, USA creases in computing power and the availability of new IZA, Bonn, Germany administrative data sources. The main advantages of ad- CESIfo, Munich, Germany ministrative data have been large sample sizes compared to CEPR, London, United Kingdom survey data, often covering the entire universe; the ability to K 30 D. Goldschmidt et al. follow the units of observation over time and the high qual- lowed for the wife to keep her birth name as part of a dou- ity of recorded information. This shift has been particularly ble (or hyphenated) last name, but she was still required to forceful in Labor and Public Economics, where the avail- take on her husband’s name as the family name. The family ability of individual level employment and tax records has name law was revised again in 1970 allowing that a cou- led to the rise in new research designs such as regression ple could decide to take on the wife’s name as the family discontinuity, regression kink or bunching designs that rely name, but kept the requirement of a common family name on very large sample sizes. While administrative data offer for both spouses. Furthermore, if a couple could not come many advantages, they also come with limitations and the to an agreement with respect to which name would become scope of available variables is often quite limited compared the family name the decision was up to the husband. This to household surveys. In particular, administrative employ- only changed with a decision by the German constitutional ment records are typically on the individual level only and court in 1991 and a subsequent revision of the family name it is often not possible to link individuals to other house- law in 1994, after which both spouses were allowed to keep hold members. For this reason, administrative data have their own birth names, while the traditional option of taking played a smaller role in studying traditional questions in la- on one of the birth names or a hyphenated double name for bor economics, such as household labor supply, household one of the spouses continued to exist. In practice it appears investment decisions in human capital or within household that it is still the case that the vast majority of women take income differences. on their husband’s names either fully or at least as part of In this project we develop a new method to impute house- a double name. While we are not aware of representative hold identifiers in the administrative employment records surveys or official registry data for Germany that would al- data in Germany to increase the scope of research ques- low us to calculate the share of couples with matching last tions that can be addressed. Our approach is to identify names, we found various press reports from city level wed- pairs of individuals who are, with a high probability, mar- ding registries that seem to suggest that even among newly ried couples using information on addresses, family names wedded couples around 85 to 90% still have a matching last and dates of birth. In Germany it is still very common that names. Among couples married for longer (and in partic- at the time of marriage one spouse (in the vast majority ular before 1994), the ratio is likely significantly higher. of cases the wife) adopts the other spouse’s last name, ei- We implement the method of identifying likely couples ther fully or as part of a double name. If two individuals using last names, addresses and age using a cross-section of with matching last names are living together at the same the administrative data from the Institute for Employment address, they are likely related, though they could also be Research (IAB) in Germany spanning the universe of em- in a sibling or parent-child relationship. To further narrow ployment and unemployment records for 2008. This data, it down to married couples we take pairs of a woman and called Integrated Employment Biographies or IEB, covers a man with matching last names with an age difference of all individuals who are employed in employment subject to less than 15 years, which should exclude most parent-child social security contributions, receive benefits from the un- relationships. We present a detailed analysis of the likely employment insurance (UI) system, or who are registered extent of errors when applying this method. The new identi- as job seekers. This data covers around 80% of employ- fiers for married couples will be made available to external ees, in particular excluding public servants and the self- researchers and users of the IAB administrative datasets, employed. By design we are only able to identify married facilitating a broad range of possible research projects that couples where both spouses are covered in the IEB. While rely on household/couple identifiers. Something to which this is certainly not a representative sample and excludes we return to in the conclusion. a sizable part of the population of couples we are still able Germany has a long tradition of women taking on their to identify over 3 Mio. couples who are likely married to husbands’ last name at the time of marriage. The German each other. Civil Code from 1896 unequivocally required that the wife The two main concerns with this approach are the poten- takes on the name of her husband. A reform in 1953 al- tial for false positives and false negatives. False positives may arise because people with matching last names may live at the same address either purely by chance, or be- While some countries do allow for linking households in their admin- istrative registry data, resulting in exciting and influential work, these countries tend to be relatively small and geographically clustered, such 3 All-in (2006) report that in Kempten in 2006 around 14% of newly as Austria (Frimmeletal. 2014) or the Scandinavian countries (e. g. married couples keep separate names. Janisch (2010) reports that Hardoy, and Schøne 2014 or Huttunen and Kellokumpu 2016). Ex- a small survey among marriage registries several German cities yielded panding the scope of administrative data to other countries will be very that around 10 to 20% of couples keep separate names. This also seems valuable to study the household behavior in new contexts. to refer to newly married couples, which suggests that the ratio of cou- See Sperling (2012) for a discussion of the legal history of the family ples with separate names among the pool of existing couples is likely name law in Germany. much lower. K Identifying couples in administrative data 31 cause they are related to each other but not married. Using 2 Data sources the distribution of same-sex matching name pairs, as well as information on family status for a subset of individuals In this chapter, the sources of the data are explained in we show that likely around 88–94% of our sample of cou- detail. Sect. 2.1 describes the Integrated Employment Bi- ples are indeed married to each other. Even if both spouses ographies (IEB) data, while the geocoded location data and of a married couple are in our data, false negatives may the individual name data are discussed in 2.2 and 2.3. arise, because we may not match them to each other. Either they do not have matching names or there are more than 2.1 Integrated employment biographies 2 matching individuals at a location, making it impossible to tell who is married to whom. False negatives will also The IEB of the Institute for Employment Research stem arise whenever one or both members of a marriage are not from the notification process of the social security system covered in the IEB data, which for example would include of the Federal Employment Agency (BA). The IEB is the all self-employed, public servants or individuals not in the basis for most of the widely used research datasets provided labor force, but also all individuals older than age 65. Us- by the IAB to the research community, such as the SIAB ing information from the Microcensus, we show that we data, the LIAB data, the BHP and many others. The IEB can identify roughly 20% of the 19 Mio. married couples consolidate completed, historicized and edited process data in Germany. Furthermore, we identify about one third of from different data sources, which come from different op- married couples where both individuals are covered by the erative systems. It comprises all persons registered with the IEB data (i. e. working in social security covered job or un- Federal Employment Agency due to the following: employed). We compare observable characteristics of our Employment subject to social security or marginal part- matched couples with the official microcensus data to show time employment. how our sample differs from the general population of mar- Receipt of unemployment insurance benefits in accor- ried couples. While the representativeness of the matched dance with Social Code Book II or III. couples is clearly limited, many research questions do not Job search registered with local employment agencies. rely on having a representative sample. The large number ● Planned or actual participation in an employment or of observations and the possibility to observe complete em- training programs. ployment histories in the IAB data should make this data a valuable tool for many research projects. We will return to The IEB includes demographic variables such as nation- a discussion of how this data can be used in the conclusion. ality, birthdate, gender, and education. Information on em- This paper is related to other research that uses the ployment, benefit receipt and job search include daily wage, special features of administrative data to impute informa- daily benefit rate, occupational and employment status or tion that is not directly available. For example, Jacobson, economic activity. Additionally location data such as place Lalonde and Sullivan (1993) use the combination of individ- of residence or work on different aggregated levels are pro- ual and firm identifiers in UI records from Pennsylvania to vided. There were around 35 Mio. working individuals in impute plant closings and mass-layoffs by observing when Germany in 2008 (own calculations based on Microcensus large numbers of individuals are moving away from firm data), about 80% of whom have at least one record in the identifiers and are scattered across many other employers. IEB. The biggest groups which are not included in the bi- Hethey-Maier and Schmieder (2013) use a similar approach ographies are self-employed workers and public servants to identify new plant openings in administrative data, rely- called Beamte. ing on worker flow information to distinguish plant open- We also have information on family status (married, liv- ings from spurious changes in firm identifiers. Goldschmidt ing alone, single parent, cohabitating), but only for the sub- and Schmieder (2015) identify outsourcing of labor services set of individuals who are unemployed and registered as in large firms employing an algorithm based on a combina- job seekers. We use this information in Sect. 4 for various tion of worker flows, industry and occupation codes. consistency checks. The next section describes the data used in this project. Sect. 3 describes our method for identifying couples and 2.2 Geocoded data presents the results based on individuals in 2008. In Sect. 4 we show supportive evidence that our method does in fact Our method relies on finding individuals living at the same largely identify married couples and develop bounds on the location. In principle individuals can be matched to other fraction of false positives. We then present characteristics of individuals at the same location either by directly com- the couples that we identify with our method and compare paring addresses, or by first geocoding addresses into lati- them to the general population in the German employment data, as well as to other data sources. Sect. 5 concludes. See Schild and Antonio (2014)p. 3. K 32 D. Goldschmidt et al. tude/longitude coordinates and then comparing coordinates. has a double name from a previous marriage). We thus as- Matching addresses directly is complicated by the fact that sume that double names are always separated by a hyphen these can often be written in a variety of ways and need and we describe below how we use hyphenated names in to be carefully cleaned. We instead match individuals on our name-matching algorithm. At the end of the cleaning geographic coordinates, where the address processing was process all letters were converted to upper case. done using GIS software, which allows for careful error Although individuals have a consistent personal iden- correction methods. The geocoding was done in a project tifier, the Einheitliche Statistische Person (ESP), the last between the Research Data Centre (FDZ) and the Univer- name may vary across different data sources. If, after the sity of Duisburg-Essen for a cross-section of all individuals name cleaning process was completed, discrepancies per- in the IEB data as of June 30th, 2008. This project used data sisted in the names across data sources, the individual was from the Federal Agency for Cartography and Geodesy, and dropped. The exception was when an individual had a dou- includes 22 Mio. addresses of German buildings and their ble last name in one source and an overlapping single last geographic coordinates and it was possible to successfully name in another (e. g. MUELLER-MEIER in one source geocode 94.6% of the IEB records. Individuals whose ad- and MEIER in another). In this case, the double last name dresses are not geocoded were dropped from the data and was kept. are not used in the further analysis. 2.3 Names 3 Identifying couples One of the criteria that we use for determining couples is As mentioned previously, although the IEB data consists whether the last names of two people match. We therefore of a large amount of information on the majority of the also obtained data on last names covering the universe of German population, it – like many administrative data sets individuals who have a record in the IEB as of June 30th, – does not include any information on the household. To 2008. In order to improve the probability of success in circumvent this issue, we combine the IEB data with the matching, we first clean the names of errors and typos, and geocoded location data and information on names to infer ensure consistency in terms of special characters and titles. probable married couples. We use the following criteria to With the support of the German Record Linkage Centre ensure that the matches we identify are most likely married (GermanRLC) and their algorithm, the names of the indi- couples and not simply two people with some other type of viduals were cleaned, taking into account certain patterns relationship (or no relationship at all): and potential discrepancies. Umlauts were substituted (ä ! 1. Same home location. ae and so forth) as well as ß to ss. All blank spaces in the 2. Uniquely matching last name. front, middle or end of the name were removed. Profes- 3. One male, one female, with an age difference of less than sional and nobility titles (such as Dr., Prof., Freiherr von) 15 years. were removed as well, and special characters (e. g. ~ or %) and non-ASCII characters (e. g. © or ™) were deleted. We go into more detail on each of these requirements The only special character that was retained is the hy- below. phen (-), which is used to indicate double names. While the family name law in the civil code book states that a spouse 3.1 Location can add their birth name to the family name does not specif- ically mention a hyphen, in practice this appears to be the The first step in identifying potential married couples is only option. In fact a court decision from 2013 specifically finding people who live at the same location, since most ruled that a couple was not allowed to combine the birth married couples live together. We start by looking at the names of two spouses without a hyphen (Kammergericht distribution of the number of individuals at a particular lo- Berlin 2013). Furthermore individuals are not allowed to cation, using each person’s geocoded coordinates, for the create last name chains that involve more than one hyphen ~33 Mio. people in our data. The second column of Table 1 (for example if at the time of marriage an individual already shows this distribution. Coordinates with a small number of individuals likely represent single-family homes, while coordinates where a larger number of individuals live are See Scholz et al. (2012). That paper is based on geocoded data from likely apartment buildings or other multi-unit residences. 2009, but 2008 was also geocoded as part of the same project. We de- About 5 Mio. individuals live alone at a coordinate – we cided to use 2008 as a baseline to allow for more analysis years after the eliminate these people from our set of potential couples, couples are identified which seemed useful for many possible research questions. In the future we hope to expand the procedure to more years. leaving us with about 28 Mio. individuals. About 7.4 Mio. See for example Schild and Antonio (2014)p. 4ff. individuals live at a location with exactly 1 other person in K Identifying couples in administrative data 33 Table 1 Distribution of the Number of Total number of indi- Number of Percent matched (%) Number of Individuals at the individuals at coordi- viduals individuals with Same Coordinate nate matched names 1 4,956,761 – – 2 7,443,038 5,082,600 68.29 3 4,911,162 1,024,758 20.87 4 3,061,944 651,742 21.29 5 1,998,695 473,896 23.71 6 1,589,814 396,944 24.97 7 1,345,134 347,244 25.81 8 1,154,712 305,390 26.45 9 971,325 259,734 26.74 10 807,360 219,600 27.20 11 673,090 182,466 27.11 12 548,928 147,280 26.83 13 451,828 120,658 26.70 14 366,646 96,724 26.38 15 304,245 79,844 26.24 16 254,032 66,272 26.09 17 209,984 53,700 25.57 18 177,840 45,022 25.32 19 151,734 37,638 24.81 20 131,940 32,064 24.30 >20 1,540,207 372,596 24.19 Total 33,050,419 9,996,172 30.25 Second column includes all geocoded data as of June 30th 2008. Third column includes all individuals with geocoded location for whom we were able to match according to our name-matching algorithm, described in the text the dataset; as the number of people living at a coordinate MEIER, who is dropped from our potential group of cou- gets larger, the absolute number of people living in this type ples. In Example 2.2 (Table 2), we again have 5 individu- of residence decreases. als living at the same coordinate: three have the last name COHLE, one has the last name HART, and one has the last 3.2 Names name HART-MEIER. Because there are more than 2 indi- viduals at this location with the last name COHLE, we can Next, we look at the cleaned names of the individuals liv- not be certain which of these are part of a couple and which ing within any given location. We require that our identified are not, so we drop all three. Because HART and HART- married couples share a last name. In situations where any MEIER share a partial name, even though one is hyphen- of the people in the location has a hyphenated name, we ated, they are kept as a potential match. In Example 2.3 consider two names to be a match if at least one part of (Table 2), there are again 5 individuals at the same coor- the hyphenated name is identical to another name at the dinate. Because COHLE, COHLE and COHLE-MEIER all location. In locations with multiple people, we addition- match in terms of their names, we must eliminate all three, ally require that a maximum of two people have matching since we have no way of knowing which two could really names. Otherwise, we have no way to determine which two be a couple. Similarly, MEIER, MEIER-MUELLER and individuals are likely to be a couple and which may be un- COHLE-MEIER must all be dropped, despite their names related, or related in other ways. The following examples matching. Therefore, in this example, there is no match help to clarify the procedure. chosen. In Example 2.1 (Table 2), there are 5 individuals living After running this algorithm over the 28 Mio. individu- at a particular coordinate. Two have the last name COHLE, als, we are left with about 5 Mio. pairs (ten million indi- and there are no others named COHLE at this location, so viduals) who share a location and last name. The third and they are kept as a potential match. Two are named HART, fourth columns of Table 1 show the number and percent of with no others named HART, and so they are also kept as people that were matched through this algorithm, organized a potential match. Finally, there is a single person named by the number of individuals at a location. For coordinates K 34 D. Goldschmidt et al. on gender and age, will eliminate some of these falsely Table 2 Examples of the name-matching procedure matched people from our sample, but not all. Number of individu- Last name Potential couple als at coordinate a 3.3 Gender and age Example 2.1 5 COHLE Match Finally, we take our set of potential couples – groups of 5 HART Match two people who share a last name and a location – and 5 COHLE Match impose gender and age restrictions. Since we are currently 5 MEIER No match only identifying heterosexual couples, we require that each 5 HART Match couple be composed of one male and one female, informa- Example 2.2 tion that is available in the IAB records. The second column 5 COHLE No match of Table 3 presents the gender composition breakdown for 5 HART Match the 5 Mio. identified potential couples. More than 4 Mio. of 5 COHLE No match these pairs consist of one male and one female, while the 5 COHLE No match remainder is made up of either two males or two females. 5 HART-MEIER Match We drop the single-sex households and move on to the age Example 2.3 difference requirement. 5 COHLE-MEIER No match We first look at the distribution of age differences among 5 MEIER No match matched pairs by gender composition. Fig. 1 graphs the dis- 5 COHLE No match tribution of the age difference between the two members of 5 COHLE No match the couple, defining the difference as the man’s age minus 5 MEIER-MUELLER No match the woman’s age. The majority of the mass lies between These are provided as examples only, and are not taken from the actual data –15 and +15. This likely includes the majority of married matches HART and COHLE are chosen couples, although it could also include brother-sister pairs match (HART-)MEIER is chosen no match is chosen (or other closely-aged family members, such as cousins). It may also include some unrelated people who simply live with only 2 individuals, almost 70% had matching names. in the same location and have the same last name. There is At coordinates with 3 or more people found at the same a smaller mass for pairs where the female is 20–40 years location, the match rate is between 20 and 30%. older than the male, which is likely to include mothers liv- There are several limitations to this criterion. First, while ing with their sons, and an even smaller mass for pairs the majority of married couples in Germany share a last where the male is 20–40 years older than the female, which name (or part of a double name), not all women (or men) likely includes father-daughter pairs. These parent child re- change their last name upon marriage, and we are certain lationships may either be single parents or families where to miss those couples. Second, in locations with multiple only one of the parents are working in employment covered people where more than two share a last name, since we in the IEB. The fact that there seem to be more mother-son can not be certain which two members are married (if any) pairs than father-daughter pairs is likely explained by the we must drop them all, eliminating more potential matches fact that there are more single mothers than single fathers. from our sample. Finally, we may be capturing two peo- Figs. 2 and 3 show the age difference distribution for ple with the same last name living in the same coordinate matched pairs with the same gender, where the age differ- who are related but not married. In addition, particularly ence is defined as the older age minus the younger age. in multi-unit residences, there may be two people who are For both of these, the majority or pairs fall between 15 and unrelated but have the same last name, and we may erro- 40, which is likely to consist mainly of mother-daughter neously be including them in our sample. Our next criteria, or father-son pairs. There is also some mass for pairs with Table 3 Gender Composition of Matched Potential Couples Matches All matches Age Difference <15 Age Difference ≥15 Absolute Percent (%) Absolute Percent (%) Absolute Percent (%) Male/female 4,084,516 81.72 3,281,657 94.65 802,859 52.44 Male/male 482,891 9.66 131,550 3.79 351,341 22.95 Female/female 430,679 8.62 53,763 1.55 376,916 24.62 Total 4,998,086 100.00 3,466,970 100.00 1,531,116 100.00 Includes all individuals with geocoded location for whom we were able to match according to our name-matching algorithm, described in the text K Identifying couples in administrative data 35 Fig. 1 Distribution of age differences of matches, male/female. Fig. 3 Distribution of age differences of matches, male/male. (Note: (Note: Includes all male-female pairs of individuals who we were Includes all male-male pairs of individuals who we were able to match able to match by location and name (according to our name-matching by location and name (according to our name-matching algorithm). algorithm). Age difference is calculated as man’s age – woman’s age) Age difference is calculated as older age – younger age) ing with each other in this age group are in fact married to each other. For determining our sample of couples, we require that the difference in age of the matched man and woman be less than 15 years. This should eliminate any mother-son or father-daughter pairs from the set of couples. The re- maining pairs – consisting of one man and one woman, with matching last names, who live in the same location and are less than 15 years apart in age – make up our fi- nal sample. Columns 4–5 of Table 3 show the results when we impose our age difference restriction. We retain 80% of our male-female couples, leaving us with a final sample of about 3.3 Mio. couples. This sample should be primar- ily composed of true couples, although some share will be “false positives”, made up of male-female siblings or fam- Fig. 2 Distribution of age differences of matches, female/female. ily members who are similar in age, or unrelated people (Note: Includes all female-female pairs of individuals who we were with the same name living at the same coordinates. able to match by location and name (according to our name-matching algorithm). Age difference is calculated as older age – younger age) 4 Consistency checks an age difference of 0–15 years; these may be siblings or other familial relationships, homosexual couples, or other Errors in our matching algorithm could occur in two ways. pairs of people who coincidentally have the same last name First, we have false positives – two people who are matched in the same location. While homosexual couples can form to each other by our algorithm, but who are not really a mar- a civil union in Germany since 2001 which allows them to ried couple. Second, there are couples that we do not pick adopt a common family name, these still seem to be rela- up with our matching method, for various reasons. We dis- tively rare, with only 34,000 same sex civil unions in 2011 cuss these two issues, and the steps we take to quantify (Statistisches Bundesamt 2012). Thus while a small part of their magnitude, below. the same sex matches might be same sex couples most of them are not. The fact that the number of same sex matched 4.1 False positives individuals in our sample is quite small, suggests that there are relatively few cases where people live together with the One type of error that could occur is when our algorithm same last name for other reasons than being married to each matches two people who are not really married to each other and that in turn most matched individuals who are liv- other, also known as type 1 error. Pairs in our sample may K 36 D. Goldschmidt et al. Fig. 4 Match Accuracy cal- culated based on Same-sex Matches by Number of Individ- uals living at same coordinate. (Note: In this figure we calcu- late the likely probability that a matched couple is indeed married to each other. For this we assume that the number of matched male/male and female/ female couples obtained by us- ing the same algorithm as for matched male/female couples is a proxy for the number of false matched. The accuracy can then be calculated as: [N(f/m) – N(f/f) – N(m/m)]/N(f/m). The figure plots this accuracy rate as a function of the number of individuals who live at the same coordinate where the couple is matched) be wrongly matched if: (1) they are brother and sister, or false positives by a small amount. Using this methodology, have some other family relationship, are close in age, and our accuracy rate is around 94% (final sample is 3,281,657; live in the same location; or (2) they are unrelated, but living estimated wrongly matched is 185,313; correctly matched = in a multi-unit residence, such as an apartment building, and 3,281,657– 185,313 = 3,096,344; accuracy rate = correctly happen to have the same last name and are close in age. matched/final sample = 3,096,344/3,281,657 = 94%). So, We can try to measure the size of this type of error in according to this method, only about 6% of our sample is our final sample of couples in a few ways. First, we can use wrongly matched and our sample does indeed identify cou- the distribution of same-sex matches to give us a sense of ples who with a high degree of certainty are indeed married what share of our sample are wrongly matched if we make to each other. the following two assumptions. The first assumption is that We can also use this approach to get a sense of whether opposite-sex family members who are close in age (i. e. the accuracy of matches varies by the number of individuals brother and sister) are as likely to live together as same-sex living at the same coordinate. Intuitively in large apartment family members (two sisters, for example). The second is buildings with many units it is more likely to have two indi- that it is as likely for two people of the opposite sex who live viduals with matching last names who are unrelated. Fig. 4 in the same building to share a last name as it is for two peo- shows the match accuracy by the number of individuals at ple of the same sex. Using these assumptions, we can look the same coordinate. The accuracy rate is clearly the highest at the number of same-sex matched pairs that fall within at coordinates with just 2 individuals with a match accuracy our age difference restriction (ages within 15 years of each rate of 95%. At coordinates with 3 individuals the match other), using the numbers provided in Table 3 – these cou- ples are likely either pairs of family members living in the Statistisches Bundesamt (2012) states that there are about 34,000 same location, or unrelated people with the same last name same sex civil unions in Germany in 2011. We do not know how com- in the same building. We find that there are 185,313 male/ mon it is for same sex couples to adopt a common family name, nor that they would both be employed and covered in our data. It appears male and female/female pairs that fall within our age restric- that due to the small number of same sex civil unions our method for tion. So, it is likely that approximately 185,000 couples in identifying male-female marriages would not work as well for identi- our sample of matched male-female couples with age differ- fying same sex civil unions. ence under 15 years are also wrongly matched. In fact, since 8 Here we assumed that two opposite sex individuals with matching there are some same-sex civil unions where partners share last names who are not married are equally likely to live together as two same sex individuals, averaging over male-male and female-fe- a family name, this arguably overestimates the number of male pairs. A more conservative assumption would be to assume that opposite-sex pairs that are not married are as likely to live together as male-male pairs, i. e. 2*131,550 = 263,100 leading to an accuracy rate of 92%. We thank an anonymous referee for pointing this out. K Identifying couples in administrative data 37 with age difference under 15 years, who are listed as both Table 4 Family Status of Individuals in Matched Couples Sample married or one married-one missing only 9% of the time. Family Status Absolute Number of Percent among non- Male-female couples with an age difference of 15 years or Individuals missing (%) more are listed as both married or one married-one missing Living alone 340,722 21.98 25% of the time. This could either indicate that there are Cohabiting 113,153 7.30 some married couples with an age difference of larger than Single parent 109,783 7.08 15 years, but could also be because these are indeed parent- Married 986,480 63.64 child relationships where the spouse is not covered in the Missing 8,446,034 – data (or does not share a last name). Total 9,996,172 – Using the information in Table 5, we can also estimate Includes all individuals who we were able to match by location and the share of matches in our final sample that are likely to name (according to our name-matching algorithm). Only individuals be true couples and not wrongly matched people (i. e. our who are registered as job-seekers have the family status variable filled in “accuracy rate”) using the subsample of couples with at least one family status listed. If we think that the family accuracy rate appears significantly smaller, which may be status variable is accurate, then the set of “true” couples because these are still likely single family homes with one in our sample should be 578,088: the number of couples or more of the children working which may lead to more who are listed of either being both married or one mar- male/male or female/female matches. For coordinates with ried, on missing family status. Even within these there may more individuals living at the same location the accuracy be individuals who were mistakenly matched. For exam- rate falls slightly but remains above 90% at least until 50 in- ple, there may be a job-seeking man with the last name dividuals at the same coordinate. Past that the number of MUELLER, whose wife is out of the workforce (and hence observations becomes quite small and the estimated accu- is not included in the IEB data), living at the same coor- racy rate becomes quite noisy, though it continues to hover dinates as a similarly-aged jobseeker woman with the last between 85 and 95%. Future researchers may want to re- name MUELLER whose husband is not in the IEB data ei- strict their analysis sample to couples with fewer number of ther. Our matching algorithm would connect these two job- individuals at the same coordinate if they want to maximize seekers, who are both listed as being married, even though the accuracy rate. they are not actually married to each other. If we think Next, we use the “Family Status” variable to perform an that it is as likely for two individuals of the same gender additional check on the validity of our sample. This vari- to be wrongly matched in this way as it is for two oppo- able is available as part of the Jobseeker-History ((X)ASU) site-gender individuals, then we can use the information dataset, and thus is only filled in for a small subset of people on family status for same-sex pairs for our accuracy esti- – those who are registered as job seekers as of June 30th, mate. Specifically, there are 5173 (637 + 4536) same-sex 2008. From our sample of approximately 10 Mio. matched matched pairs with age difference less than 15 years where individuals, about 1.5 Mio. have the family status variable family status is listed as both married or married-missing. filled in. The variable takes on four possible values: living Since we know that these are wrongly matched pairs, we alone, cohabiting, single parent, or married. Table 4 depicts can assume that the same number of opposite-sex pairs was the distribution of family status values across all individuals wrongly matched as well. So, the estimated “true” number with a matched name within their location. Although 85% of couples in the subsample of couples with family sta- are missing the family status variable, of those in the data tus is 572,915 (578,088 matched M-F with age difference with a family status listed, approximately 64% are listed as <15 and family status married-married or married-missing married, 22% are listed as living alone, while the rest are ei- minus 5173 same-sex pairs with age difference <15 and ther cohabiting or are single parents. We investigate further married-married or married-missing status). Since our full by looking at the combination of family status for matched sample of matched couples (with family status) is made up pairs, shown separately by gender composition and age dif- of 649,643 (3,281,657–2,632,014) couples, our estimated ference (Table 5). When we look at male-female pairs with accuracy rate is 88.2% (572, 915 “true” couples/649,643 an age difference under 15 years, we see that, for couples total couples in our final sample of couples with family sta- with at least one family status listed, they are listed as ei- tus filled in for at least one of the members), or 11.8% error ther both married or one married-one missing family status rate. 89% of the time. This is far higher than for same-sex pairs 9 10 These are typically either people who are unemployed (in particu- We are again being conservative here, assuming that among the lar unemployment insurance recipients are required to register as job same-sex matched couples, none are true couples (same-sex civil seekers) or who expect to be unemployed soon. unions). As discussed before this is likely a very small group. K 38 D. Goldschmidt et al. Table 5 Family Status Composition, for matched couples Family Status Opposite sex Same sex Combinations Age diff <15 Age diff ≥15 Age diff <15 Age diff ≥15 Absolute Percent (%) Absolute Percent Absolute Percent Absolute Percent (%) (%) (%) Alone-alone 5762 0.89 9073 3.98 9854 17.65 6987 3.51 Alone-missing 26,692 4.11 69,514 30.50 28,148 50.43 61,258 30.76 Alone-cohabit 3124 0.48 6066 2.66 2538 4.55 5197 2.61 Alone-single 1795 0.28 16,050 7.04 594 1.06 14,573 7.32 parent Alone-married 9207 1.42 15,670 6.88 1391 2.49 15,553 7.81 Cohabit-cohabit 3248 0.50 2401 1.05 1337 2.40 2197 1.10 Cohabit-missing 7001 1.08 13,607 5.97 4331 7.76 12,815 6.44 Cohabit-single 757 0.12 9500 4.17 196 0.35 9348 4.69 parent Cohabit-married 5870 0.90 6764 2.97 303 0.54 7370 3.70 Single parent- 85 0.01 58 0.03 219 0.39 399 0.20 single parent Single parent- 5331 0.82 22,240 9.76 1595 2.86 21,261 10.68 missing Single parent- 2683 0.41 1055 0.46 136 0.24 1147 0.58 married Married-married 229,279 35.29 8078 3.54 637 1.14 1111 0.56 Married-missing 348,809 53.69 47,851 20.99 4536 8.13 39,925 20.05 Both Missing 2,632,014 – 574,932 – 129,498 – 529,116 – Total 3,281,657 – 802,859 – 185,313 – 728,257 – The sample includes all couples who we were able to match by location and name (according to our name-matching algorithm). Only individuals who are registered as job-seekers have the family status variable filled in We may expect fewer errors of this type in our matching is not covered in the IEB. In order to get a sense of what algorithm if we restrict our focus to coordinates with ex- share of couples we can identify in our data, we obtained actly two people – in this case, there are likely to be fewer the Scientific Use File of the Microcensus 2008 (see Boehle mismatched pairs of the type described above. When we re- 2010), to calculate the number of married couples in 2008 peat the accuracy rate estimation, restricting our sample to overall and the number of married couples that satisfy the matched couples living at coordinates where exactly 2 peo- sample restrictions that we have to apply in the IEB data. ple live, we find that to be the case: our estimated error rate Overall, there were 19,187,000 married couples in 2008; of is likely a bit lower, around 8.6% (see Appendix Table 7). those, about 9.2 Mio. were such that both spouses would While using the job-seeker data is helpful for estimat- live together, would be less than 15 years apart in age, and ing the likely fraction of false positives, it should be kept would be covered in the IEB data, i. e. either working in in mind that neither is this subsample representative, nor a social security covered job or being unemployed. Since, necessarily is family status measured without errors. It may in our final sample, we have 3.2 Mio. couples, we capture well be the case that we are overestimating or underesti- about one third of the total number of married couples that mating the number of false positives here. Overall, based match our baseline restrictions. on the two approaches discussed, we estimate that the frac- If the couple does not share a last name (or part of a hy- tion of false positives lies somewhere in the range of 6% to phenated name), then we would not capture them with our 11.8%. algorithm. Until 1991 it was required by German law that married couples share a last name, and even afterwards 4.2 Missing couples most change or hyphenate their last name upon marriage. Although we were not able to find official statistics on this Given the data we are using and the matching algorithm topic, according to several newspaper articles the share of we have developed, we are likely to have missed many true new couples who share a last name is around 85 to 90%. married couples, either among individuals who are in our Couples where one or both members are non-German are dataset (a form of type 2 error) or where at least one spouse the least likely to share a last name. K Identifying couples in administrative data 39 restriction does not exclude many true couples. There are more matched pairs in Fig. 1 where the man is around 25 years older than the woman, but Fig. 5 shows that that is ex- actly where the share of married/married is falling to zero, thus suggesting that here we have mainly pairs who are not matched to each other . Couples not living together on June 30th, 2008 are im- possible for us to identify with our data; however, we be- lieve that this situation is likely to be rare. If the couple lives at a location with more than 2 people with the same last name at the same coordinate, we have no way of knowing which two people are part of a couple, and so all are dropped (about 5.2 Mio.). We drop people who have inconsistent names across data sources, thus potentially omitting more couples from our Fig. 5 Share of matched pairs listed as married-married or married- sample (about 1.8 M). missing. (Note: Includes all male-female pairs of individuals who we We can get a sense of how representative our final sample were able to match by location and name (according to our name- matching algorithm), and where at least one member has the family of couples is by comparing their characteristics to those of status variable filled in. Age difference is calculated as man’s age – a truly representative sample of couples, those in the Mi- wife’s age) crocensus. Table 6 compares individual characteristics of people in our final sample of couples (column 3) to couples Couples where the age difference between the husband in the Microcensus in 2008. Column (6) shows all mar- and wife is more than 15 years are omitted from our sample ried couples in 2008, while column (7) shows all couples in an effort to ensure that we do not mistakenly include par- satisfying the restrictions of our algorithm in the IEB. In ent-child pairs in our sample. Although there are certainly terms of the age distribution, our men and women tend to married couples with a 15-year or larger age difference, the be a younger than those of all census couples; this can be number of these types of couples is quite small. For exam- explained by the fact that our sample only includes people ple, in the micro census, a representative survey of German in the workforce, so older workers who are more likely to households, the share of couples with a 16-year or more be retired are excluded. In addition, anyone married to a re- age difference was only 2% in 2008. tired person will be omitted from our final sample, since We also investigated the likely impact of our age restric- their spouse will not be in our original dataset. Comparing tion using the marital status variable available in the job the last column where we apply the same restrictions as in seeker data. For the subsample of couples where we have our matching algorithm, we find that the age distribution is the marital status for at least one of the two individuals, much closer to our matched couples. in Fig. 5 we plotted the share of couples where either both Looking next at the labor force status, we do not have were reported as married or one person was married and the the full range of labor force status options that are available other person’s marital status was missing. Matched couples in the micro census, since the IAB data only includes peo- where both are married seem to be very rare when the ple in the labor force but omits self-employed and public woman is older than 15 years than the man. This suggests servants. The couples in the last column of Table 6 look that there are almost no true couples that we are missing reasonably similar in terms of labor force status as our with the 15 years age difference restriction. On the other matched couples sample, although they are somewhat less end there is still a high share of couples where the man is likely to be unemployed. This might be because some long- around 15 to 20 years older than the woman where both are term unemployed who are in the IEB might be identified reported as married. If these are true couples, then we are as out of the labor force in the Microcensus, or because we excluding them from our set of likely married couples. No- are somehow more likely to identify unemployed individ- tice however that while the share is significant, Fig. 1 shows that there are almost no couples in the 15 to 20 years age This can also be seen from Table 5 if we look at the subsample of window (consistent with the information from the micro our matched couples with the family status variable available. Of the male-female pairs where both are listed as married, only 3% have an census), again suggesting that the 15 years age difference age difference of 15 years or more. Appendix Fig. 7 shows the same figure restricted to couples at lo- cations with exactly two individuals at the same coordinate. Consistent with out discussion in section 4.1, the match accuracy appears to be slightly higher at coordinates with just 2 individuals. K 40 D. Goldschmidt et al. Table 6 Comparing Individuals and Couples with Microcensus Individual level: Sample All individu- Final Matched Sample Microcensus Microcensus 2008 als 2008 restricted Restriction 2 People at >2 People at Coordinate Coordinate Number of individuals on 6.44 5.65 2 9.53 – – coordinate Age husband <35 0.33 0.12 0.07 0.17 0.07 0.11 ≥35 and <45 0.26 0.31 0.33 0.29 0.20 0.36 ≥45 and <65 0.39 0.54 0.57 0.51 0.43 0.53 ≥65 0.03 0.03 0.03 0.03 0.30 0.01 Age wife <35 0.31 0.17 0.11 0.23 0.11 0.17 ≥35 and <45 0.25 0.34 0.39 0.29 0.21 0.40 ≥45 and <65 0.42 0.47 0.48 0.47 0.41 0.43 ≥65 0.02 0.01 0.01 0.02 0.27 0.00 Labor Force Status Employee 0.84 0.88 0.93 0.83 0.60 0.93 Unemployed 0.13 0.10 0.06 0.13 0.03 0.07 Education Secondary/intermediate 0.78 0.82 0.81 0.83 0.69 0.72 school leaving certificate Upper secondary school 0.21 0.18 0.20 0.17 0.26 0.24 leaving Living in East Germany 0.15 0.17 0.16 0.17 0.20 0.21 Number of individuals 33,050,419 6,563,314 3,384,124 3,179,190 38,374,000 18,454,000 Couple Level: Restriction All matches –– – – – male/female Age difference no age difference 0.08 0.10 0.11 0.10 0.10 0.10 ≥1 and <4 0.41 0.51 0.52 0.49 0.47 0.51 ≥4 and <7 0.20 0.25 0.25 0.25 0.24 0.25 ≥7 and <11 0.09 0.11 0.10 0.12 0.11 0.11 ≥11 and <16 0.03 0.03 0.03 0.04 0.04 0.03 ≥16 0.19 0.00 0.00 0.00 0.05 – Nationality Both German 0.90 0.90 0.96 0.83 0.87 0.88 One German 0.06 0.07 0.03 0.10 0.08 0.06 Both non-German 0.04 0.04 0.01 0.06 0.05 0.06 Number of couples 4,084,516 3,281,657 1,692,062 1,589,595 19,187,000 9,227,000 – – – – – (SUF: n = (SUF: n = 109,073) 226,787) The table compares mean characteristics of the overall population of individuals in the IEB data in 2008 (Column 1), with the uniquely matched couples (Column 2–4) data and couples from the Microcensus 2008. Column 5 corresponds to all married couples in the Microcensus and column 6 to married couples that satisfy the same restrictions that we impose in our matching algorithm: husband and wife live together and are in social security covered job or unemployed and the age difference is less than 15 years K Identifying couples in administrative data 41 to a single last name, relating the frequency of that name in the overall population (on the x-axis) with the frequency of that name among matched couples (y-axis). The black line represents the 45 degree line. Amazingly almost all names are very close to the 45 degree line, suggesting that neither very rare nor very common last names are more or less likely to be matched. Again, while we are clearly not obtaining a representative sample, it is interesting that we do not seem to be biased against particularly common or rare last names. 5 Discussion and conclusion We present a new method for identifying a very large num- Fig. 6 Frequency of Surnames in Population and Among Matched Couples. (Note: Each dot in the figure represents one unique last name ber of pairs of individuals who are likely married to each in our data (such as Meier or Mueller). The figure shows the frequency other in the German administrative data. While room for of each last name in the overall population (x-axis) and among the sam- type 1 (false positives) and type 2 (false negatives) errors ple of matched couples (y-axis). The black line represents the 45 degree exists, our analysis suggests that our final sample still con- line, suggesting that most names are equally common in the overall population and among matched couples) tains about 89 to 94% actually married couples. An im- portant caveat is that due to the nature of the IEB, our uals as part of couples in the IEB. Interestingly, when we sample of married couples is not representative of all mar- restrict the matched couples data to a sample with exactly ried couples, but at best representative of couples where 2 people at a location (Column 4) the distribution is much both individuals are either working in a job that is cov- closer to the census. ered by social security (that is not civil service job or self In the bottom half of Table 6 we can compare the char- employed) or are unemployed and receiving benefits. Our acteristics of couples in the two different data sets. The comparison with the Microcensus from our baseline year distribution of age difference within couples of our final suggests that our matched couples look reasonably similar sample (column 3) is almost exactly the same as that of the to couples in this more restrictive sample frame, but even Microcensus when using the same restrictions as in our al- then we are more likely to pick up married couples who live gorithm (column 7). The couples in our sample are slightly in smaller buildings, such as single family homes, and thus more likely to be both German and less likely to be both probably couples who are either living in less densely pop- non-German than those of the micro census; as mentioned ulated areas or with higher income levels. Finally, since we earlier, non-Germans are less likely to change their name rely on last names our sample will miss all couples where at marriage than Germans are, and so are more likely to be the spouses do not share a name and this decision is likely omitted by our matching algorithm. Overall, although we correlated with other characteristics of the couple. miss many couples in our data set and may mistakenly in- While the representativeness of this matched couple data clude some pairs who are not truly married, the couples that is therefore clearly limited, many research questions do not we identify seem roughly similar to the universe of couples rely on a representative sample. Most natural experiments in Germany that satisfy the restrictions that are imposed in that have been used by applied researchers only affect a very the matching algorithm. selected subsample of the population (e. g. typical regres- Finally, we performed an additional check to see whether sion discontinuity or regression kink designs), but obtaining our algorithm is more likely to pick up very rare or very causally interpretable parameters with a high degree of in- common last names by comparing the distribution of last ternal validity is still very valuable even if it cannot easily names in the overall population with the distribution of last be extrapolated to the general population. names among matched couples. On the one hand, we might Overall, the method appears accurate enough to open the be more likely to find unique matches in the case of rare door for future research projects analyzing research ques- last names, in which case rare last names would be more tions in labor and public economics that rely on house- common in our matched couples data than in the overall hold (couple) identifiers using administrative data. We are population. On the other hand we might be more likely to working on making these identifiers available to external obtain false positives in the case of common last names, in researchers through the existing IAB research data infra- which case those would be overrepresented in our matched structure. We can readily imagine a wide number of pos- data. Fig. 6 shows a scatterplot, where each dot corresponds sible applications. For example, a long literature has stud- K 42 D. Goldschmidt et al. ied the added worker effect, which is whether spouses of our new identifier could be used is to study relative in- displaced workers respond to the job loss by increasing comes within married couples as for example in Bertrand their own labor supply (see for example Lundberg 1985, et al. (2015). Other areas where important work has been or Stephen 2002). Most existing work in this literature has done with the IAB data that could be extended using our relied on panel survey datasets such as the PSID or GSOEP. couple identifiers include for example the labor supply and Using our identifier, it will be possible to study the added mobility responses to immigration shocks (Dustmann et al. worker effect for a much larger sample of workers after 2016), or the effects of maternity leave policies on labor a variety of well identified shocks such as plant closings supply (Schönberg and Ludsteck 2014). or mass layoffs. Another promising area of research is to We believe that providing access to a new way to study study spillover effects of public programs. For example, household decisions and responses in administrative data Cullen and Gruber (2000) provide fascinating evidence that will inspire the research community to many new and cre- more generous unemployment insurance benefits reduce la- ative research projects. bor supply of spouses married to the benefit recipient. A lot Open Access This article is distributed under the terms of the of recent work on UI has been done with the German admin- Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted istrative data (e. g. Schmieder et al. 2012, 2016) exploiting use, distribution, and reproduction in any medium, provided you give the large number of observations and clean sources of iden- appropriate credit to the original author(s) and the source, provide a tification such as age discontinuities in potential duration. link to the Creative Commons license, and indicate if changes were With the possibility to link married couples it will be pos- made. sible to use similar research designs to look at questions as in Cullen and Gruber (2000) to understand how households as a whole are affected by policies such as UI, active la- Appendix bor market policies or tax policies. Another example where Table 7 Family Status Composition, for matched couples living at coordinates with exactly 2 people total Different sex Same sex Age diff <15 Age diff ≥15 Age diff <15 Age diff ≥15 Combinations Absolute Percent Absolute Percent Absolute Percent Absolute Percent (%) (%) (%) (%) Alone-alone 1228 0.54 1504 2.08 2385 14.38 1278 2.01 Alone-missing 9956 4.42 30,437 41.99 10,765 64.92 27,634 43.44 Alone-cohabit 412 0.18 624 0.86 338 2.04 542 0.85 Alone-single 293 0.13 1922 2.65 95 0.57 1864 2.93 parent Alone-married 1170 0.52 4106 5.67 280 1.69 3840 6.04 Cohabit-cohabit 431 0.19 226 0.31 132 0.80 213 0.33 Cohabit-miss- 1742 0.77 3622 5.00 903 5.45 3537 5.56 ing Cohabit-single 98 0.04 1103 1.52 13 0.08 1136 1.79 parent Cohabit-mar- 915 0.41 1118 1.54 39 0.24 1122 1.76 ried Single parent- 22 0.01 8 0.01 15 0.09 40 0.06 single parent Single parent- 1595 0.71 4420 6.10 339 2.04 4154 6.53 missing Single parent- 357 0.16 212 0.29 11 0.07 223 0.35 married Married-mar- 47,922 21.25 1404 1.94 77 0.46 211 0.33 ried Married-miss- 159,344 70.67 21,774 30.04 1190 7.18 17,816 28.01 ing Both missing 1,466,577 – 326,445 – 67,477 – 302,644 – Total 1,692,062 – 398,925 – 84,059 – 366,254 – Includes all couples who we were able to match by location and name (according to our name-matching algorithm), restricted to couples living at coordinates where no other people are listed. Only individuals who are registered as job-seekers have the family status variable filled in K Identifying couples in administrative data 43 Hethey-Maier, T., Schmieder, J.F.: Does the use of worker flows im- prove the analysis of establishment turnover? Evidence from Ger- man administrative data. J Appl Soc Sci Stud – Schmollers Jahrb 2013 133(4), 477–510 (2013) Huttunen, K., Kellokumpu, J.: The effect of job displacement on cou- ples’ fertility decisions. J Labor Econ 34(2), 403–442 (2016) Jacobson, L.S., LaLonde, R.J., Sullivan, D.G.: Earnings losses of dis- placed workers. Am Econ Rev 83, No. 4, 685–709 (1993) Janisch, W.: Namenswahl nach der Heirat: Bekenntnis zum Mann (2010). http://www.sueddeutsche.de/leben/namenswahl-nach- der-heirat-bekenntnis-zum-mann-1.79245, Accessed August 1, Kammergericht Berlin: Eheregistereintragung: Schreibweise von Ehenamen und Begleitnamen (2013). http://www. gerichtsentscheidungen.berlin-brandenburg.de/jportal/? quelle=jlink&docid=KORE209412013&psml=sammlung.psml& max=true&bs=10, Accessed September 1, 2014 Lundberg, S.: The added worker effect. J Labor Econ 3, 11–37 (1985) Schild, C.-J., Antoni, M.: Linking survey data with administrative so- Fig. 7 Share of matched pairs listed as married-married or married- cial security data – the project “Interactions between capabilities missing; 2 people at a coordinate. (Note: Includes all male-female pairs in work and private life”. Working paper series, vol. 2014–02. of individuals who we were able to match by location and name (ac- German Record-Linkage Center, Nürnberg, p 11 (2014) cording to our name-matching algorithm), and where at least one mem- Schmieder, J.F., von Wachter, T., Bender, S.: The effects of extended ber has the family status variable filled in. Restricted to couples living unemployment insurance over the business cycle: Evidence from at coordinates where exactly 2 people are located. Age difference is regression discontinuity estimates over 20 years. Q J Econ 127(2), calculated as man’s age – wife’s age) 701–752 (2012) Schmieder, J.F., von Wachter, T., Bender, S.: The effect of unemploy- ment benefits and nonemployment durations on wages. Am Econ References Rev 106(3), 739–777 (2016) Scholz, T., Rauscher, C., Reiher, J., Bachteler, T.: Geocoding of Ger- All-in: Immer mehr behalten ihren Geburtsnamen – Zahl der Ehen man administrative data. FDZ-Methodenreport, vol. 2012–09. In- mit Doppelnamen bleibt seit Jahren gleich (2006). http://www. stitute for Employment Research, Nürnberg, (2012) all-in.de/nachrichten/lokales/Immer-mehr-behalten-ihren- Schönberg, U., Ludsteck, J.: Expansions in maternity leave coverage Geburtsnamen;art26090,215128, Accessed September 1, 2014 and mothers’ labor market outcomes after childbirth. J Labor Bertrand, M., Kamenica, E., Pan, J.: Gender identity and relative in- Econ 32(3), 469–505 (2014) come within households. Q J Econ 130(2), 571–614 (2015) Sperling, F.: Familiennamensrecht in Deutschland und Frankreich: Boehle, M., Schimpl-Neimanns, B.: GESIS – Leibniz-Institut für eine Untersuchung der Rechtslage sowie namensrechtlicher Kon- Sozialwissenschaften (Ed.): Mikrozensus Scientific Use File flikte in grenzüberschreitenden Sachverhalten. Mohr Siebeck, 2008 : Dokumentation und Datenaufbereitung. Bonn (GESIS- Tübingen, p 226 (2012) Technical Reports 2010/13) (2010). http://nbn-resolving.de/urn: Statistisches Bundesamt: Bevölkerung und Erwerbstätigkeit – Haushalte nbn:de:0168-ssoar-207237 und Familien Ergebnisse des Mikrozensus, 1st edn. 3. Statistis- Cullen, J.B., Gruber, J.: Does unemployment insurance crowd out ches Bundesamt, Wiesbaden (2012) spousal labor supply? J Labor Econ 18(3), 546–572 (2000) Stephens Jr, M.: Worker displacement and the added worker effect. Dustmann, C., Schönberg, U., Stuhler, J.: Labor supply shocks, native J Labor Econ 20(3), 504–537 (2002) wages, and the adjustment of local employment. Quarterly Journal of Economics 132 (1), 435–483 (2017) Deborah Goldschmidt Associate, Analysis Group Frimmel, W., Halla, M., Winter-Ebmer, R.: Can pro-marriage policies work? An analysis of marginal marriages. Demography 51(4), Wolfram Klosterhuber Research Fellow, Institute for Employment 1357–1379 (2014) Research (IAB) Goldschmidt, D., Schmieder, J.F.: The rise of domestic outsourcing and the evolution of the German wage structure, Quarterly Journal of Johannes F Schmieder Assistant Professor, Department of Eco- Economics (forthcoming) nomics, Boston University; Peter Paul Career Development Professor, Hardoy, I., Schøne, P.: Displacement and household adaptation: In- Assistant Professor sured by the spouse or the state? J Popul Econ 27(3), 683–703 (2014)

Journal

Journal for Labour Market ResearchSpringer Journals

Published: May 15, 2017

There are no references for this article.