(c) Larry Leemis (2008) 7

Data Scientist: Person who is worse at statistics than any statistician & worse at software engineering than any software engineer.

~ Will Cukierski

Inspired by this long list of things that separate a fresh programmer from an actual engineer. 2 The following is also very long, but aims to enlighten, not to chastise.

### What is data science?

It's a trendy way of saying 'statistical programming'.

OK, that's a bit of a stretch: data science is statistical programming that focusses on newly feasible methods for getting answers out of annoyingly large piles of data. (So, SAS work isn't said to count.) 8

### Is it a bullshit fad?

Not exactly. All, and I mean all of its insights come straight out of academic statistics, computer science, and dropout hacker lore. But just because something's made of other things doesn't mean it isn't real, that it isn't a valuable permutation.

(I grant you that only half of the excitement is due to these methods, the other half being a truckful of marketing hype.)

"Data science" is a silly name for a few reasons: because all statisticians apply scientific methods to data, and have done for a hundred years; because all programming is programming on data; because the actual job has much more data munging and rote scripting in it than it has experimental science.

### Distinguishing analysts and data scientists

Data analysis is an art, in Knuth's sense:

Science is everything we understand well enough to explain to a computer. Art is everything else we do.

and thus so is data science. 1 But it isn't cool to call yourself a data analyst, for some reason. 4

Here's an attempt: you want a data scientist, and not the similar data analyst, when:
• When the data is too big for Excel.
• And then when the data is too big for SAS.
• When the data is in the wrong format for the tool
• When the data's quality is uncertain
• When the data does not have a natural outcome variable.

### Why now?

1. The underlying cause of most of it is the data deluge - we collect millions of times more data, basically for free. This deluge leads to all the distinctive features:
• Distributed computing, because the dataset doesn't fit in one computer's RAM.

• Sudden great performance, because it turns out that scaling your sample size gives a qualitative change for some domains (“Unreasonable Effectiveness of Data”). Same algorithms as the 60s sometimes!

### Don't know higher maths, don't know code

Hell mode.

If you're like me, you'll need qualitative meaning first, to get you fired up: Gleick's The Information and Nilsson's Quest for Artificial Intelligence, Hofstadter's Goedel, Escher, Bach, Floridi's Blackwell Guide to the Philosophy of Computation, Howson's Scientific Reasoning, O'Neil's Weapons of Math Destruction, Domingos' Master Algorithm. Then the real work begins:

• Start with Python: Zed Shaw's Learn the Hard Way will get you moving, if you do as he tells you. Learn Python 3; you probably won't have much Python2 legacy code to handle.
• You must read other people's code to improve. Try a gamified site like CodeWars, which shows you elegant solutions only after you sweat out a dreadful hack.
• This is the nicest series of maths videos I've ever seen. Constant clicking of concepts into place.
• Once you are comfortable with basic abstraction, work through Downey's Think Stats.
• I enjoyed Blais' Statistical Inference for Everyone too.
Then read everything in the other sections below. (:

Danger zone.

### Know higher maths, can't code

A reach ten times your grasp.

### Know higher maths, can code

Well what are you waiting for?

Full list here. All books: I don't care for MOOCs or Youtube series. You can ignore all Medium posts on the matter (or any matter), with the exception of @akelleh, which you must read.

### Glossary

• algorithm: a recipe for computing something. Programs are instances of an algorithm, like you're an instance of the human genome.

When people talk about "ML algorithms" they tend to mean learners. 404

• machine learning: "LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION." 405

• learner: an inductive program that takes in examples and outputs a model.
• model: a program that predicts outputs from examples. (see also classifier, regressor).
• data scientist: an inductive program that takes a vaguely posed problem, plus a dataset, and outputs a learner.

• examples: rows. (AKA record, case, instance, observation, data point, 'sample' 403.)In most usecases, examples are tuples of scalars.
• features: columns. (AKA attributes, explanatory variables, independent variables, predictors, manipulated variables, inputs, exposure, risk factors, regressors, variates, covariates.)
• output: (AKA response variable, dependent variable, predicted, explained, experimental, outcome, regressand, label.)
• index: row names
• schema or header: column names

• training set: the examples given to the learner
• test set: the examples given to the model
• contamination: using test data at any stage during building the model
• dimensionality: the number of features you give to the learner.

• classifier: a model that takes in numbers and gives you a type in response.
• regressor: a model that takes in numbers and gives you a number in response.

• overfitting: hallucinating a classifier; putting too much trust in flukes or mistakes in the training set. (encoding the noise)

• bias: underfitting. a learner’s tendency to consistently learn the wrong thing.
• variance: overfitting. a learner’s tendency to learn random things and not the real signal
• bias–variance tradeoff: you can't both finding a function that generalises v finding a function that captures all the signal of the training set. Simple functions mean bias, complex functions mean variance.

• hypothesis space: the set of classifiers that a learner can learn. Determined by representation choice (and by the optimisation function, if the evaluation function has multiple optimal choices).

• theory: a machine that answers a class of questions. 3

if praise is given to those who have determined the number of Platonic solids... how much better will it be to bring under mathematical laws human reasoning, the most excellent and useful thing we have?
- Gottfried von Leibniz

The learning process may be regarded as a search for a form of behaviour which will satisfy the teacher.
- Alan Turing

1. Though it maybe won't remain an art for long.
2. That Programmer Competency Matrix is aggressively written: it is exasperated with poor job applicants. But when I was a clueless beginner, I found it helpful for someone to just tell me what exactly I was missing, even if that was a long list and 10 years' work to go.
3. This quotation isn't very relevant, I just think it's bad ass. It is from Judea Pearl's acceptance speech for the Lakatos Award.

4. It's easy to say the reason is just pay, but that's circular: data scientists are paid more in prospect, because of the buzz around them, before they get anything done in a given organisation.

I suppose the real productivity gains of e.g. Hammerbacher and LeCun could have caused the hype...
5. "70%" :

Estimate quality: OK, a Spiegelhalter (2*).
Sample size: 250. Not randomised.
Source: O'Reilly.
Importance to argument: High.

6. The other very distinct job is of course "data engineer". But those are already distinguished properly in the job market, for the obvious reason that your cluster will crash and burn if you hire the wrong one of them.
7. Man this is a great site.
8. New methods? Newer than what?

Newer than pre-computer and tiny-sample statistical inference theory. You can get a sense for how new ML is from Bishop (2006). That book only mentions very important papers. And, in fact, summarising its bibliography gets us a median year of 1995 and a mode year 1999. So even ignoring the disproportionately important last ten years, it is a very young field.
9. The hyperparameters of a data scientist are mathematical rigour, coding ability, misanthropy and pay.
10. Kaggle is just feature play and modelling. (So, if this Process defines the job 'data scientist' well, then Kaggle is just a ML platform, not a data science platform, as labelled.)

11. Also called statistical error or random error, but those names are vague to me. I suppose they get called statistical errors because you can use pure stats to detect and control for them, unlike systematic error.
12. Model error and parameter error seem like they could be unified: if a model error is just a sufficiency large (number of) parameter errors. Making bad assumptions about e.g. auto-correlation between your features seems like just parameter error, while using the wrong distribution is model error.

This hinges on whether models are just big collections of parameters, I guess.
13. which billions do not even count the main value Linux creates, externalities: induced demand for other software, and coerced quality from MS and Apple.
14. Writing analysis scripts is much less heavy than real development, where it'd be absurdly expensive to 'roll your own' very often.
15. Following recent convention, she calls the systems 'algorithms'. But it isn't the inaccurate or static abstract program that does the harm, but credulous lack of validation of the program. They only do harm when they are allowed to make or guide decisions.