Atjaunināt sīkdatņu piekrišanu

Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data [Mīkstie vāki]

4.00/5 (17 ratings by Goodreads)
  • Formāts: Paperback / softback, 175 pages, height x width: 232x178 mm
  • Izdošanas datums: 02-Jun-2020
  • Izdevniecība: O'Reilly Media
  • ISBN-10: 1492072745
  • ISBN-13: 9781492072744
Citas grāmatas par šo tēmu:
  • Mīkstie vāki
  • Cena: 60,87 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Standarta cena: 71,61 €
  • Ietaupiet 15%
  • Grāmatu piegādes laiks ir 3-4 nedēļas, ja grāmata ir uz vietas izdevniecības noliktavā. Ja izdevējam nepieciešams publicēt jaunu tirāžu, grāmatas piegāde var aizkavēties.
  • Daudzums:
  • Ielikt grozā
  • Piegādes laiks - 4-6 nedēļas
  • Pievienot vēlmju sarakstam
  • Formāts: Paperback / softback, 175 pages, height x width: 232x178 mm
  • Izdošanas datums: 02-Jun-2020
  • Izdevniecība: O'Reilly Media
  • ISBN-10: 1492072745
  • ISBN-13: 9781492072744
Citas grāmatas par šo tēmu:

Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data&;fake data generated from real data&;so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.

Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.

This book describes:

  • Steps for generating synthetic data using multivariate normal distributions
  • Methods for distribution fitting covering different goodness-of-fit metrics
  • How to replicate the simple structure of original data
  • An approach for modeling data structure to consider complex relationships
  • Multiple approaches and metrics you can use to assess data utility
  • How analysis performed on real data can be replicated with synthetic data
  • Privacy implications of synthetic data and methods to assess identity disclosure
Preface vii
1 Introducing Synthetic Data Generation
1(22)
Defining Synthetic Data
1(3)
Synthesis from Real Data
2(1)
Synthesis Without Real Data
2(1)
Synthesis and Utility
3(1)
The Benefits of Synthetic Data
4(4)
Efficient Access to Data
4(1)
Enabling Better Analytics
5(1)
Synthetic Data as a Proxy
6(1)
Learning to Trust Synthetic Data
6(2)
Synthetic Data Case Studies
8(13)
Manufacturing and Distribution
9(2)
Healthcare
11(6)
Financial Services
17(2)
Transportation
19(2)
Summary
21(2)
2 Implementing Data Synthesis
23(26)
When to Synthesize
24(1)
Identifiability Spectrum
24(1)
Trade-Offs in Selecting PETs to Enable Data Access
25(14)
Decision Criteria
28(1)
PETs Considered
29(4)
Decision Framework
33(3)
Examples of Applying the Decision Framework
36(3)
Data Synthesis Projects
39(3)
Data Synthesis Steps
39(2)
Data Preparation
41(1)
The Data Synthesis Pipeline
42(5)
Synthesis Program Management
47(1)
Summary
48(1)
3 Getting Started: Distribution Fitting
49(20)
Framing Data
50(1)
How Data Is Distributed
50(10)
Fitting Distributions to Real Data
60(2)
Generating Synthetic Data from a Distribution
62(5)
Measuring How Well Synthetic Data Fits a Distribution
62(1)
The Overfitting Dilemma
63(4)
A Little Light Weeding
67(1)
Summary
67(2)
4 Evaluating Synthetic Data Utility
69(26)
Synthetic Data Utility Framework: Replication of Analysis
71(3)
Synthetic Data Utility Framework: Utility Metrics
74(18)
Comparing Univariate Distributions
75(4)
Comparing Bivariate Statistics
79(4)
Comparing Multivariate Prediction Models
83(4)
Distinguishability
87(5)
Summary
92(3)
5 Methods for Synthesizing Data
95(20)
Generating Synthetic Data from Theory
95(4)
Sampling from a Multivariate Normal Distribution
96(1)
Inducing Correlations with Specified Marginal Distributions
97(1)
Copulas with Known Marginal Distributions
98(1)
Generating Realistic Synthetic Data
99(4)
Fitting Real Data to Known Distributions
101(1)
Using Machine Learning to Fit the Distributions
102(1)
Hybrid Synthetic Data
103(3)
Machine Learning Methods
106(1)
Deep Learning Methods
107(1)
Synthesizing Sequences
108(4)
Summary
112(3)
6 Identity Disclosure in Synthetic Data
115(22)
Types of Disclosure
116(7)
Identity Disclosure
116(1)
Learning Something New
117(1)
Attribute Disclosure
117(2)
Inferential Disclosure
119(1)
Meaningful Identity Disclosure
120(1)
Defining Information Gain
121(1)
Bringing It All Together
121(1)
Unique Matches
122(1)
How Privacy Law Impacts the Creation and Use of Synthetic Data
123(12)
Issues Under the GDPR
125(4)
Issues Under the CCPA
129(1)
Issues Under HIPAA
130(3)
Article 29 Working Party Opinion
133(2)
Summary
135(2)
7 Practical Data Synthesis
137(10)
Managing Data Complexity
137(5)
For Every Pre-Processing Step There Is a Post-Processing Step
138(1)
Field Types
138(1)
The Need for Rules
138(1)
Not All Fields Have to Be Synthesized
139(1)
Synthesizing Dates
140(1)
Synthesizing Geography
141(1)
Lookup Fields and Tables
141(1)
Missing Data and Other Data Characteristics
141(1)
Partial Synthesis
142(1)
Organizing Data Synthesis
142(4)
Computing Capacity
142(1)
A Toolbox of Techniques
143(1)
Synthesizing Cohorts Versus Full Datasets
143(1)
Continuous Data Feeds
144(1)
Privacy Assurance as Certification
144(1)
Performing Validation Studies to Get Buy-In
144(1)
Motivated Intruder Tests
145(1)
Who Owns Synthetic Data?
145(1)
Conclusions
146(1)
Index 147
Dr. Khaled El Emam is a senior scientist at the Children's Hospital of Eastern Ontario (CHEO) Research Institute and Director of the multi-disciplinary Electronic Health Information Laboratory. Lucy Mosquera has a bachelor's degree in Biology and Mathematics from Queen's University and is a current graduate student in the department of statistics at the University of British Columbia. During her time at Queen's, Lucy provided data management support on a dozen clinical trials and observational studies run through Kingston General Hospital's Clinical Evaluation Research Unit. Lucy has also worked on clinical trial data sharing methods based on homomorphic encryption and secret sharing protocols. At Replica Analytics, Lucy is responsible for developing statistical and machine learning models for data generation, and integrating subject area expertise in clinical trial data into synthetic data generation methods, as well as the statistical assessments of our synthetic data generation. Dr. Richard Hoptroff is a long term technology inventor, investor and entrepreneur.