DataFrames in Julia#

Let’s continue our exploration of Julia by looking at one of the most popular packages in the ecosystem, DataFrames.

DataFrames are a powerful and convenient way to work with tabular data. They are very popular in the R and Python ecosystems and Julia can speak DataFrames too.

DataFrames Package#

By now installing a new package in Julia should be something you know how to do, but for reference:

   JuliaHEP-2023 git:(main) julia --project=.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.3 (2023-08-24)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

## At the julia> prompt type `]`

(JuliaHEP-2023) pkg> add DataFrames
...

Now we can see how to construct a simple data frame

using DataFrames
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
[2731] signal (2): Interrupt
in expression starting at none:0
.plt at /opt/hostedtoolcache/julia/1.9.3/x64/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm23ReplaceableMetadataImpl11getOrCreateERNS_8MetadataE at /opt/hostedtoolcache/julia/1.9.3/x64/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16MetadataTracking5trackEPvRNS_8MetadataENS_12PointerUnionIJPNS_15MetadataAsValueEPS2_EEE at /opt/hostedtoolcache/julia/1.9.3/x64/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm6MDNodeC1ERNS_11LLVMContextEjNS_8Metadata11StorageTypeENS_8ArrayRefIPS3_EES7_ at /opt/hostedtoolcache/julia/1.9.3/x64/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm12DISubprogram7getImplERNS_11LLVMContextEPNS_8MetadataEPNS_8MDStringES6_S4_jS4_jS4_jiNS_6DINode7DIFlagsENS0_9DISPFlagsES4_S4_S4_S4_S4_S4_NS3_11StorageTypeEb at /opt/hostedtoolcache/julia/1.9.3/x64/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm9DIBuilder14createFunctionEPNS_7DIScopeENS_9StringRefES3_PNS_6DIFileEjPNS_16DISubroutineTypeEjNS_6DINode7DIFlagsENS_12DISubprogram9DISPFlagsENS_24MDTupleTypedArrayWrapperINS_19DITemplateParameterEEEPSA_NSC_INS_6DITypeEEENSC_IS8_EE at /opt/hostedtoolcache/julia/1.9.3/x64/bin/../lib/julia/libLLVM-14jl.so (unknown line)
emit_function at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/codegen.cpp:7744
jl_emit_code at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/codegen.cpp:8478
jl_create_native_impl at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/aotcompile.cpp:344
jl_precompile_ at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/precompile_utils.c:258
jl_precompile_worklist at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/precompile_utils.c:308 [inlined]
ijl_create_system_image at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/staticdata.c:2642
ijl_write_compiler_output at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/precompile.c:117
ijl_atexit_hook at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/init.c:258
jl_repl_entrypoint at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:718
main at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/cli/loader_exe.c:59
unknown function (ip: 0x7fd5aea29d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
unknown function (ip: (nil))
Allocations: 50958921 (Pool: 50921991; Big: 36930); GC: 73
Failed to precompile DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0] to "/home/runner/.julia/compiled/v1.9/DataFrames/jl_EhhtTm".

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
   @ Base ./loading.jl:2300
 [3] compilecache
   @ ./loading.jl:2167 [inlined]
 [4] _require(pkg::Base.PkgId, env::String)
   @ Base ./loading.jl:1805
 [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
   @ Base ./loading.jl:1660
 [6] macro expansion
   @ ./loading.jl:1648 [inlined]
 [7] macro expansion
   @ ./lock.jl:267 [inlined]
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1611

Creating DataFrames#

function leptons()
    name = ["electron", "muon", "tau"]
    symbol = ["e", "μ", "τ"]
    mass = [0.5109989, 105.657, 1776.86]
    charge = -1.0
    DataFrame(; name, symbol, mass, charge)
end
leptons (generic function with 1 method)
df_leptons = leptons()
3×4 DataFrame
Rownamesymbolmasscharge
StringStringFloat64Float64
1electrone0.510999-1.0
2muonμ105.657-1.0
3tauτ1776.86-1.0

As you can see the DataFrame is constructed with a bunch of vectors, of different types (but the same length!), and also using a scalar, when every row has the same value.

There are different ways to construct DataFrames from dictionaries, named tuples, matrices and so on, as well as using alternative column names - read the docs for all the ways!

Loading Data from CSV#

Loading data from a CSV file is extremely common. You will need to use the CSV package.

using CSV

Just before we do that, as this table is pretty big, let’s set some more appropriate display parameters for the tutorial

ENV["DATAFRAMES_ROWS"] = 20;

And we load the table.

N.B. If you have this tutorial checked out from GitHub, or running in Binder, then the relative path below is correct, viz., in the ./assets directory; you can also download the data directly.

higgs_ml = CSV.read(joinpath("assets", "atlas-higgs-challenge-2014-v2-reduced.csv"), DataFrame)
50000×19 DataFrame
49980 rows omitted
RowEventIdPRI_tau_ptPRI_tau_etaPRI_tau_phiPRI_lep_ptPRI_lep_etaPRI_lep_phiPRI_metPRI_met_phiPRI_met_sumetPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_ptLabel
Int64Float64Float64Float64Float64Float64Float64Float64Float64Float64Int64Float64Float64Float64Float64Float64Float64Float64String1
110000032.6381.0170.38151.6262.273-2.41416.824-0.277258.733267.4352.150.44446.0621.24-2.475113.497s
210000142.0142.039-3.01136.9180.5010.10344.704-1.916164.546146.2260.7251.158-999.0-999.0-999.046.226b
310000232.154-0.705-2.093121.409-0.9531.05254.283-2.186260.414144.2512.053-2.028-999.0-999.0-999.044.251b
410000322.647-1.6550.0153.321-0.522-3.131.0820.0686.0620-999.0-999.0-999.0-999.0-999.0-999.0-0.0b
510000428.209-2.197-2.23129.7740.7981.5692.723-0.87153.1310-999.0-999.0-999.0-999.0-999.0-999.00.0b
610000553.6510.3711.32931.565-0.8841.85740.7352.237282.849390.547-2.412-0.65356.1650.2243.106193.66b
710000628.851.1132.40997.240.675-0.96638.421-1.443294.0742123.010.8641.4556.8670.131-2.767179.877s
810000778.80.6541.54728.740.506-1.34722.275-1.761187.299130.638-0.715-1.724-999.0-999.0-999.030.638s
910000839.0082.433-2.53226.3250.211.88437.7910.024129.8040-999.0-999.0-999.0-999.0-999.0-999.00.0b
1010000954.646-1.5330.41632.742-0.317-0.636132.6780.845294.7411167.735-2.767-2.514-999.0-999.0-999.0167.735s
4999114999023.0890.32.1246.3560.158-2.38471.4782.98338.4912166.372-2.844-0.14856.3-1.5342.692222.672s
4999214999164.296-1.95-2.07863.568-2.053-0.62657.335-1.639355.6551140.7460.5931.712-999.0-999.0-999.0140.746s
4999314999220.4320.52.28244.285-1.5790.25151.998-2.795184.1460-999.0-999.0-999.0-999.0-999.0-999.00.0b
4999414999324.1081.616-2.42472.1081.407-0.29234.003-2.212216.834163.7052.8842.632-999.0-999.0-999.063.705s
4999514999426.691-1.067-2.87245.83-0.939-1.25629.4111.087177.0770-999.0-999.0-999.0-999.0-999.0-999.00.0b
4999614999573.1740.8191.3230.5810.952-0.63651.2070.12553.8573193.2970.14-1.79892.691-0.5261.653398.099b
4999714999630.498-0.341.37568.1360.8691.245134.87-0.186409.8192149.11-0.234-2.23558.1840.5942.188207.294b
4999814999738.094-0.936-3.10632.158-0.9480.15261.561-0.269348.6252122.3-1.4143.04368.4832.5180.046190.783s
4999914999829.2250.7460.52150.012.074-2.60924.5890.476116.5390-999.0-999.0-999.0-999.0-999.0-999.00.0s
5000014999930.6450.8592.11126.094-0.4840.45143.2011.002161.614149.353-2.641-2.038-999.0-999.0-999.049.353s

The second argument to CSV.read tells CSV to read the file into a DataFrame and is a nice illustration of how packages in Julia can remain independent, but still work together (DataFrames has really great integration with the Julia ecosystem).

N.B. this is a reduced version of the Higgs ML dataset, restricted to 50k events, where we have also stripped out columns of derived data and event weights. This is just to make this example more visually manageable in this notebook - everything which is done here works just fine on the full 818k events with all columns.

To get the names of all the columns in a data frame, use names():

println(join(names(higgs_ml), "\n"))
EventId
PRI_tau_pt
PRI_tau_eta
PRI_tau_phi
PRI_lep_pt
PRI_lep_eta
PRI_lep_phi
PRI_met
PRI_met_phi
PRI_met_sumet
PRI_jet_num
PRI_jet_leading_pt
PRI_jet_leading_eta
PRI_jet_leading_phi
PRI_jet_subleading_pt
PRI_jet_subleading_eta
PRI_jet_subleading_phi
PRI_jet_all_pt
Label

Also rather useful is the describe function:

describe(higgs_ml)
19×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1EventId1.25e51000001.25e51499990Int64
2PRI_tau_pt38.758420.031.848396.8750Float64
3PRI_tau_eta-0.0141847-2.494-0.022.4920Float64
4PRI_tau_phi-0.00554088-3.142-0.0243.1410Float64
5PRI_lep_pt46.623126.040.501423.4380Float64
6PRI_lep_eta-0.0234368-2.49-0.0532.490Float64
7PRI_lep_phi0.0426212-3.1420.0843.1420Float64
8PRI_met41.74480.234.90752842.620Float64
9PRI_met_phi-0.00511272-3.142-0.0363.1420Float64
10PRI_met_sumet209.70513.678179.3241512.240Float64
11PRI_jet_num0.9786801.030Int64
12PRI_jet_leading_pt-348.632-999.038.7805755.2350Float64
13PRI_jet_leading_eta-399.447-999.0-1.88054.4820Float64
14PRI_jet_leading_phi-399.442-999.0-2.0653.1410Float64
15PRI_jet_subleading_pt-692.799-999.0-999.0557.180Float64
16PRI_jet_subleading_eta-709.495-999.0-999.04.4960Float64
17PRI_jet_subleading_phi-709.492-999.0-999.03.1410Float64
18PRI_jet_all_pt72.9546-0.040.44651417.330Float64
19Labelbs0String1

Basic Operations#

Accessing Data#

The template for accessing data from a DataFrame is:

my_data[selected_rows, selected_columns]

There are a few different patterns for this, but the template is always the same.

Extracting data (without copying) works like this:

higgs_ml[!, :PRI_jet_num]
50000-element Vector{Int64}:
 2
 1
 1
 0
 0
 3
 2
 1
 0
 1
 ⋮
 1
 0
 1
 0
 3
 2
 2
 0
 1

This is the recommended way to do this, although higgs_ml.PRI_jet_num and higgs_ml[!, "PRI_jet_num"] will also work

higgs_ml[6, :PRI_jet_num]
3

If you modify the data accessed this way, then you are modifying the primary DataFrame.

This is why in the [] notation a ! is used - caveat emptor!

higgs_ml[6, :PRI_jet_num] = 666 # Completely bonkers!
666
higgs_ml[6, [:PRI_jet_num]]
DataFrameRow (1 columns)
RowPRI_jet_num
Int64
6666

When the column specifier is a scalar, one retrieves the actual value. When it’s a list of columns (even length 1) then a DataFrame object is returned.

higgs_ml.PRI_jet_num[6] = 3 # Restore sanity!
3

Copying Data#

If a : notation is used for the row selection, then a copy of the data is made:

mini_higgs = higgs_ml[1:5, [:EventId, :PRI_tau_pt]] # Select the given columns from rows 1 to 5
5×2 DataFrame
RowEventIdPRI_tau_pt
Int64Float64
110000032.638
210000142.014
310000232.154
410000322.647
510000428.209

To show this is a copy, let’s reset the values of \(\tau_{p_T}\):

mini_higgs[!, :PRI_tau_pt] = [1.2, 2.3, 3.4, 4.5, 5.6]
5-element Vector{Float64}:
 1.2
 2.3
 3.4
 4.5
 5.6
higgs_ml[!, :PRI_tau_pt][1:5]
5-element Vector{Float64}:
 32.638
 42.014
 32.154
 22.647
 28.209

The primary data stayed unmodified.

One can use an appropriate row vector to set any row in the data frame:

mini_higgs[3, 1:2] = [666, 999.0]
mini_higgs
5×2 DataFrame
RowEventIdPRI_tau_pt
Int64Float64
11000001.2
21000012.3
3666999.0
41000034.5
51000045.6

The selection of columns is very flexible:

  • Not() - exclude columns from the selection

  • Cols() - union of arguments in the selection

  • regexp - a regular expression match against column names

  • N::Integer - pick the Nth column (and M:N works as you would expect)

  • : - all columns not yet selected

The selected output column ordering is respected, so allowing for easy reordering of columns. This is particularly useful when operating on a data frame with select, which we will meet below, and with the fact that : will not select columns already included.

higgs_jets = higgs_ml[:, Cols(:EventId, r"PRI_jet.*")]
50000×9 DataFrame
49980 rows omitted
RowEventIdPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64Int64Float64Float64Float64Float64Float64Float64Float64
1100000267.4352.150.44446.0621.24-2.475113.497
2100001146.2260.7251.158-999.0-999.0-999.046.226
3100002144.2512.053-2.028-999.0-999.0-999.044.251
41000030-999.0-999.0-999.0-999.0-999.0-999.0-0.0
51000040-999.0-999.0-999.0-999.0-999.0-999.00.0
6100005390.547-2.412-0.65356.1650.2243.106193.66
71000062123.010.8641.4556.8670.131-2.767179.877
8100007130.638-0.715-1.724-999.0-999.0-999.030.638
91000080-999.0-999.0-999.0-999.0-999.0-999.00.0
101000091167.735-2.767-2.514-999.0-999.0-999.0167.735
499911499902166.372-2.844-0.14856.3-1.5342.692222.672
499921499911140.7460.5931.712-999.0-999.0-999.0140.746
499931499920-999.0-999.0-999.0-999.0-999.0-999.00.0
49994149993163.7052.8842.632-999.0-999.0-999.063.705
499951499940-999.0-999.0-999.0-999.0-999.0-999.00.0
499961499953193.2970.14-1.79892.691-0.5261.653398.099
499971499962149.11-0.234-2.23558.1840.5942.188207.294
499981499972122.3-1.4143.04368.4832.5180.046190.783
499991499980-999.0-999.0-999.0-999.0-999.0-999.00.0
50000149999149.353-2.641-2.038-999.0-999.0-999.049.353

Selection from bool#

A powerful way to select data is to select rows on a boolean vector constructed from the data frame itself, e.g., to select all rows that are signal events do the following.

(Below we explain why you need to use .== to broadcast the comparison.)

higgs_ml[higgs_ml.Label .== "s", [:EventId, :Label]]
17065×2 DataFrame
17045 rows omitted
RowEventIdLabel
Int64String1
1100000s
2100006s
3100007s
4100009s
5100015s
6100017s
7100023s
8100026s
9100027s
10100028s
17056149970s
17057149978s
17058149981s
17059149986s
17060149990s
17061149991s
17062149993s
17063149997s
17064149998s
17065149999s

Views#

With view() or @view we create a view into a dataframe, which is fast and efficient

leading_jets = @view higgs_ml[:, r"PRI_jet_leading.*"]
50000×3 SubDataFrame
49980 rows omitted
RowPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phi
Float64Float64Float64
167.4352.150.444
246.2260.7251.158
344.2512.053-2.028
4-999.0-999.0-999.0
5-999.0-999.0-999.0
690.547-2.412-0.653
7123.010.8641.45
830.638-0.715-1.724
9-999.0-999.0-999.0
10167.735-2.767-2.514
49991166.372-2.844-0.148
49992140.7460.5931.712
49993-999.0-999.0-999.0
4999463.7052.8842.632
49995-999.0-999.0-999.0
49996193.2970.14-1.798
49997149.11-0.234-2.235
49998122.3-1.4143.043
49999-999.0-999.0-999.0
5000049.353-2.641-2.038

Just to emphasise the point, higgs_jets is an independent data frame, and leading_jets is a data frame view into the primary data.

Broadcast Assignment#

To broadcast operations across a data frame, we use Julia’s .= operation

mini_higgs[!, :PRI_tau_pt] .= 999.0
mini_higgs
5×2 DataFrame
RowEventIdPRI_tau_pt
Int64Float64
1100000999.0
2100001999.0
3666999.0
4100003999.0
5100004999.0
mini_higgs[!, :EventId] .+= 10
mini_higgs
5×2 DataFrame
RowEventIdPRI_tau_pt
Int64Float64
1100010999.0
2100011999.0
3676999.0
4100013999.0
5100014999.0

Adding New Data#

Adding new data to a data frame is just a matter of assigning to a new column (using the Julia symbol for the name is useful)

mini_higgs[:, :name] = ["alice", "bob", "ciarn", "dinah", "elmer"]
mini_higgs
5×3 DataFrame
RowEventIdPRI_tau_ptname
Int64Float64String
1100010999.0alice
2100011999.0bob
3676999.0ciarn
4100013999.0dinah
5100014999.0elmer

Data Manipulation#

So much for selecting and replacing data - how do we do more interesting thing?

The first thing we might want to do is ensure that we can select events that match some particular criteria - for that we can use the filter function, like this:

lots_o_jets(n_jets) = n_jets >= 3
filter(:PRI_jet_num => lots_o_jets, higgs_jets)
4436×9 DataFrame
4416 rows omitted
RowEventIdPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64Int64Float64Float64Float64Float64Float64Float64Float64
1100005390.547-2.412-0.65356.1650.2243.106193.66
2100011376.773-0.790.30356.8761.773-2.079165.64
31000313182.4491.3830.00138.006-1.257-0.609253.461
41000383114.6020.6190.16577.0532.433-2.637341.947
5100039388.399-2.168-1.42377.27-2.3851.876198.632
6100059380.042-0.8561.30452.5010.638-1.114182.413
7100060378.174-1.668-0.97858.097-0.989-1.727212.314
8100070359.4011.342-0.36953.711-2.5772.14162.577
9100077356.9510.749-0.29642.88-2.229-2.825140.06
10100082385.392-1.0620.16681.5590.513-2.255343.858
4427149845385.346-3.078-1.29266.050.022-2.06184.672
4428149855359.409-0.608-0.33937.6833.515-2.542164.274
4429149856361.1851.019-0.52758.4233.314-2.992155.665
44301498903273.5390.8072.5778.5410.353-0.696441.915
44311499003128.952-1.287-0.64337.8674.046-0.265199.454
44321499423118.8621.1740.39736.505-3.3571.89191.244
44331499753283.116-1.009-1.419124.487-1.4890.036489.347
4434149983364.0151.9381.02947.6731.8242.178148.696
44351499853320.4520.758-2.373143.8981.407-1.119505.049
44361499953193.2970.14-1.79892.691-0.5261.653398.099

Usually one would not want to bother with a named function for these kind of trivial selections - use an anonymous function:

filter(:PRI_jet_num => nj -> nj >= 3, higgs_jets)
4436×9 DataFrame
4416 rows omitted
RowEventIdPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64Int64Float64Float64Float64Float64Float64Float64Float64
1100005390.547-2.412-0.65356.1650.2243.106193.66
2100011376.773-0.790.30356.8761.773-2.079165.64
31000313182.4491.3830.00138.006-1.257-0.609253.461
41000383114.6020.6190.16577.0532.433-2.637341.947
5100039388.399-2.168-1.42377.27-2.3851.876198.632
6100059380.042-0.8561.30452.5010.638-1.114182.413
7100060378.174-1.668-0.97858.097-0.989-1.727212.314
8100070359.4011.342-0.36953.711-2.5772.14162.577
9100077356.9510.749-0.29642.88-2.229-2.825140.06
10100082385.392-1.0620.16681.5590.513-2.255343.858
4427149845385.346-3.078-1.29266.050.022-2.06184.672
4428149855359.409-0.608-0.33937.6833.515-2.542164.274
4429149856361.1851.019-0.52758.4233.314-2.992155.665
44301498903273.5390.8072.5778.5410.353-0.696441.915
44311499003128.952-1.287-0.64337.8674.046-0.265199.454
44321499423118.8621.1740.39736.505-3.3571.89191.244
44331499753283.116-1.009-1.419124.487-1.4890.036489.347
4434149983364.0151.9381.02947.6731.8242.178148.696
44351499853320.4520.758-2.373143.8981.407-1.119505.049
44361499953193.2970.14-1.79892.691-0.5261.653398.099

Need to filter based on multiple columns? No problem:

filter([:PRI_jet_num, :PRI_jet_leading_pt] => (nj, ptj) -> (nj >= 3) && (ptj > 100), higgs_jets)
2302×9 DataFrame
2282 rows omitted
RowEventIdPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64Int64Float64Float64Float64Float64Float64Float64Float64
11000313182.4491.3830.00138.006-1.257-0.609253.461
21000383114.6020.6190.16577.0532.433-2.637341.947
31000843176.49-0.5582.66473.5660.49-1.616333.586
41001183148.1741.109-1.21140.8180.7961.344380.547
51001923138.456-1.1171.4354.6513.1440.688226.618
61002063127.039-0.1112.166115.8970.352-0.622322.533
71002323117.933-1.3222.03174.298-0.8811.982277.127
81002433230.6171.2421.60961.046-1.299-1.247337.151
91003113144.005-0.9252.87947.2040.41.11233.132
101003193406.435-0.3010.405132.52-1.7081.67640.728
22931496943193.1320.8291.098111.8330.7412.557446.408
22941497373118.6220.514-1.931107.43-0.1481.3305.799
22951497523140.332-1.1780.1352.2011.635-0.285242.703
22961497563114.993-2.137-1.43570.141.0061.465291.784
22971498903273.5390.8072.5778.5410.353-0.696441.915
22981499003128.952-1.287-0.64337.8674.046-0.265199.454
22991499423118.8621.1740.39736.505-3.3571.89191.244
23001499753283.116-1.009-1.419124.487-1.4890.036489.347
23011499853320.4520.758-2.373143.8981.407-1.119505.049
23021499953193.2970.14-1.79892.691-0.5261.653398.099

Notice that the arguments to the filter method follow the Julia convention, with the filter parameters given first, followed by the data object.

Of note is the selector pattern columns => filter_funtion - we shall see this repeated!

Derived Data#

using Statistics

To get summary data for a data frame, use the combine() function. There is a mini-language for applying functions to the data is the same as for filter (but note the arguments are reversed). The second => determines the destination column (which will be created, if needed). If you do not give this then a plausible name will be generated for the output.

combine(higgs_jets, :PRI_jet_all_pt => mean => :jet_pt_mean)
1×1 DataFrame
Rowjet_pt_mean
Float64
172.9546

The special property of combine is that it collapses the output down to unique values.

Scalar and vector outputs can also be combined:

combine(higgs_jets, :PRI_jet_all_pt => mean => :jet_pt_mean, :PRI_jet_num => unique => :n_jets)
4×2 DataFrame
Rowjet_pt_meann_jets
Float64Int64
172.95462
272.95461
372.95460
472.95463

But this probably isn’t quite what we wanted to do as the mean of \(p_T\) is always calculated for all jets.

To do this in a more useful way, we use the groupby() function to split the data frame up by a certain criterion:

combine(groupby(higgs_jets, :PRI_jet_num), :PRI_jet_all_pt => mean, nrow)
4×3 DataFrame
RowPRI_jet_numPRI_jet_all_pt_meannrow
Int64Float64Int64
100.019992
2164.652315518
32149.43710054
43257.4424436

Derived Data and missing values#

For some analysis, it’s pretty useful to add derived values, which we know how to do:

select(higgs_ml, :PRI_jet_leading_pt => (x -> x.^2) => :pt2)
50000×1 DataFrame
49980 rows omitted
Rowpt2
Float64
14547.48
22136.84
31958.15
4998001.0
5998001.0
68198.76
715131.5
8938.687
9998001.0
1028135.0
4999127679.6
4999219809.4
49993998001.0
499944058.33
49995998001.0
4999637363.7
4999722233.8
4999814957.3
49999998001.0
500002435.72

So far, so good, but notice that there are a lot of columns where pt==-999.0, which was the input dataset convention for a missing value, so this isn’t quite what we wanted.

We could filter out all the unphysical values, but with data frames there is an option to set such values to missing.

First, for a data frame that does not yet have missing values, first we call the allowmissing() function - this changes our columns of type T into Union{T, Missing}. Then we convert all the negative values we find into missing.

higgs_set_missing_jets = allowmissing(higgs_ml)[:, Cols(:EventId, r"PRI_jet.*")] # Just work with jets for now
missing_value(v) = if (v===missing || v<0) missing else v end
transform!(higgs_set_missing_jets, :PRI_jet_leading_pt => ByRow(missing_value) => :PRI_jet_leading_pt)
50000×9 DataFrame
49980 rows omitted
RowEventIdPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64?Int64?Float64?Float64?Float64?Float64?Float64?Float64?Float64?
1100000267.4352.150.44446.0621.24-2.475113.497
2100001146.2260.7251.158-999.0-999.0-999.046.226
3100002144.2512.053-2.028-999.0-999.0-999.044.251
41000030missing-999.0-999.0-999.0-999.0-999.0-0.0
51000040missing-999.0-999.0-999.0-999.0-999.00.0
6100005390.547-2.412-0.65356.1650.2243.106193.66
71000062123.010.8641.4556.8670.131-2.767179.877
8100007130.638-0.715-1.724-999.0-999.0-999.030.638
91000080missing-999.0-999.0-999.0-999.0-999.00.0
101000091167.735-2.767-2.514-999.0-999.0-999.0167.735
499911499902166.372-2.844-0.14856.3-1.5342.692222.672
499921499911140.7460.5931.712-999.0-999.0-999.0140.746
499931499920missing-999.0-999.0-999.0-999.0-999.00.0
49994149993163.7052.8842.632-999.0-999.0-999.063.705
499951499940missing-999.0-999.0-999.0-999.0-999.00.0
499961499953193.2970.14-1.79892.691-0.5261.653398.099
499971499962149.11-0.234-2.23558.1840.5942.188207.294
499981499972122.3-1.4143.04368.4832.5180.046190.783
499991499980missing-999.0-999.0-999.0-999.0-999.00.0
50000149999149.353-2.641-2.038-999.0-999.0-999.049.353

Of note is the convenience function ByRow() that takes care of broadcasting the function to each row in the column(s).

Also we used here the transform function - this is very like filter, but more specialised for DataFrames (the argument order is different, with the data frame coming first). In fact we used transform!, so we modified directly the higgs_set_missing_jets data frame.

select!(higgs_set_missing_jets, :EventId, :PRI_jet_leading_pt => (x -> x.^2) => :pt2, :)
50000×10 DataFrame
49980 rows omitted
RowEventIdpt2PRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_ptPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64?Float64?Int64?Float64?Float64?Float64?Float64?Float64?Float64?Float64?
11000004547.48267.4352.150.44446.0621.24-2.475113.497
21000012136.84146.2260.7251.158-999.0-999.0-999.046.226
31000021958.15144.2512.053-2.028-999.0-999.0-999.044.251
4100003missing0missing-999.0-999.0-999.0-999.0-999.0-0.0
5100004missing0missing-999.0-999.0-999.0-999.0-999.00.0
61000058198.76390.547-2.412-0.65356.1650.2243.106193.66
710000615131.52123.010.8641.4556.8670.131-2.767179.877
8100007938.687130.638-0.715-1.724-999.0-999.0-999.030.638
9100008missing0missing-999.0-999.0-999.0-999.0-999.00.0
1010000928135.01167.735-2.767-2.514-999.0-999.0-999.0167.735
4999114999027679.62166.372-2.844-0.14856.3-1.5342.692222.672
4999214999119809.41140.7460.5931.712-999.0-999.0-999.0140.746
49993149992missing0missing-999.0-999.0-999.0-999.0-999.00.0
499941499934058.33163.7052.8842.632-999.0-999.0-999.063.705
49995149994missing0missing-999.0-999.0-999.0-999.0-999.00.0
4999614999537363.73193.2970.14-1.79892.691-0.5261.653398.099
4999714999622233.82149.11-0.234-2.23558.1840.5942.188207.294
4999814999714957.32122.3-1.4143.04368.4832.5180.046190.783
49999149998missing0missing-999.0-999.0-999.0-999.0-999.00.0
500001499992435.72149.353-2.641-2.038-999.0-999.0-999.049.353

Notice that missing values were handled nicely!

We also used the selectfunction here. This is very similar to transform, just that only the columns which we ask for are included in the result, rather than all columns. As select returns the columns in the order we ask for it was simple to add the new pt2 column where we wanted it to be.

It should be noted that missing values are not some special magical implementation. They are a well defined data type in Julia (the type is Missing), for which common arithmetic operations are well defined:

@show missing * missing
@show 1.0 + missing
@show missing * 3;
missing * missing = missing
1.0 + missing = missing
missing * 3 = missing

This is a great example of how Julia’s type system works so powerfully with multiple dispatch!

Transform, Select, Combine, GroupBy, Filter#

Just as a short summary of the data frame manipulation functions we met:

Function

Description

transform

Apply a transformation operation to one or more columns, return all columns plus any new ones

select

Apply a transformation operation to one or more columns, only return columns that are selected, in the order requested

combine

Apply a transformation operation, then collapse the result for identical output rows

groupby

Split a data frame into pieces according to a certain criterion

filter

Apply a selection operation to a data frame - argument order follows the method convention

The use of groupby and combine allows us to powerfully manipulate data in Julia using the well known Split, Combine, Apply strategy, originally introduced for S.

Visualising#

There is extremely good integration between the Julia plotting ecosystem and data frames. Here we give a quick tour of some of the plots that we can easily make with this data:

using Plots
using StatsPlots

For convenience, we’ll create a subset of our data, selecting higher \(p_T\) jets. We also want to benefit from missing columns.

interesting_jets = allowmissing(filter([:PRI_jet_num, :PRI_jet_leading_pt] => (nj, ptj) -> (nj >= 1) && (ptj > 100), higgs_jets)[1:2000, :])
select!(interesting_jets, :EventId, :PRI_jet_subleading_pt => ByRow(missing_value) => :PRI_jet_subleading_pt, :)
2000×9 DataFrame
1980 rows omitted
RowEventIdPRI_jet_subleading_ptPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_all_pt
Int64?Float64?Int64?Float64?Float64?Float64?Float64?Float64?Float64?
110000656.8672123.010.8641.450.131-2.767179.877
2100009missing1167.735-2.767-2.514-999.0-999.0167.735
310002382.4772195.5331.1561.416-0.798-2.785278.009
410002743.4582170.712-1.9612.222.974-0.103214.17
510003138.0063182.4491.3830.001-1.257-0.609253.461
610003877.0533114.6020.6190.1652.433-2.637341.947
710005756.312214.449-0.0581.5251.151-1.743270.759
810007870.7862101.9343.1390.444-2.683-0.567172.721
9100079missing1116.316-1.1710.641-999.0-999.0116.316
1010008473.5663176.49-0.5582.6640.49-1.616333.586
1991112539missing1104.845-1.334-2.465-999.0-999.0104.845
199211254562.0193164.8750.848-1.4082.135-2.548274.187
199311254749.6743114.0130.941-2.161.034-2.971235.454
1994112550missing1112.6010.1241.47-999.0-999.0112.601
199511255360.9412199.28-0.427-1.3971.675-2.371260.221
1996112554123.6963154.292-1.860.095-2.2472.389376.563
199711255671.0093106.3720.209-1.612-0.4023.118223.655
1998112568199.5743208.952-1.6861.2-1.1371.357440.235
199911258170.1832117.8770.31-2.573-2.504-0.281188.06
200011258248.5153106.2572.108-1.887-1.021-2.734202.71

The first example is a simple scatter plot of the \((\eta, \phi)\) coordinates of the leading jet:

@df interesting_jets scatter(:PRI_jet_leading_eta, :PRI_jet_leading_phi, label="Jet Location", xlabel="η", ylabel="ϕ")

For this data the marginal histogram is always an interesting way to look at the data distribution

@df interesting_jets marginalhist(:PRI_jet_leading_eta, :PRI_jet_leading_phi, label="Jet Location", bins=20, xlabel="η", ylabel="ϕ")

Now we can look at a histogram of the leading jets’ \(p_T\):

@df interesting_jets histogram(:PRI_jet_leading_pt, label="Leading Jet pT")

And it’s very easy to plot both the leading and subleading distribution together:

@df interesting_jets histogram(:PRI_jet_leading_pt, alpha=0.4, label="Leading Jet pT")
@df interesting_jets histogram!(:PRI_jet_subleading_pt, alpha=0.4, label="Subleading Jet pT")

The marginal histogram shows the relationship between the two jet components:

@df interesting_jets marginalhist(:PRI_jet_leading_pt, :PRI_jet_subleading_pt, label="Jet Location", bins=20, xlims=(0, 550), ylims=(0,550))

Finally, let’s look at the distance in \((\eta, \phi)\) space between the leading and subleading jet.

# Clean up the "missing data" columns for the subleading jet
select!(interesting_jets, :EventId, 
    :PRI_jet_subleading_eta => ByRow(missing_value) => :PRI_jet_subleading_eta, 
    :PRI_jet_subleading_phi => ByRow(missing_value) => :PRI_jet_subleading_phi, :);

We define a function dist for the cartesian distance:

δϕ(ϕ1, ϕ2) = begin
    δ = ϕ1 - ϕ2
    while δ > pi
        δ -= 2π
    end
    while δ < -pi
        δ += 2π
    end
    δ
end
dist(η1::Number, ϕ1::Number, η2::Number, ϕ2::Number) = sqrt((η1-η2)^2 + δϕ(ϕ1, ϕ2)^2)
dist(η1, ϕ1, η2, ϕ2) = missing
dist (generic function with 2 methods)

This is a little bit tricksy - we need to ensure that

  • when the ϕ separation is calculated we normalise it to [-π, π]

  • if any of the values are missing then the distance is missing

    • we use Julia’s multiple dispatch to achieve that by defining two implementations for the dist method

Now run this function to calculate the angle between the two jets:

select!(interesting_jets, :EventId,
    [:PRI_jet_leading_eta, :PRI_jet_leading_phi, :PRI_jet_subleading_eta, :PRI_jet_subleading_phi] => ByRow(dist) => :Jet_distance, :)
2000×10 DataFrame
1980 rows omitted
RowEventIdJet_distancePRI_jet_subleading_etaPRI_jet_subleading_phiPRI_jet_subleading_ptPRI_jet_numPRI_jet_leading_ptPRI_jet_leading_etaPRI_jet_leading_phiPRI_jet_all_pt
Int64?Float64?Float64?Float64?Float64?Int64?Float64?Float64?Float64?Float64?
1100006missing0.131missing56.8672123.010.8641.45179.877
2100009missingmissingmissingmissing1167.735-2.767-2.514167.735
3100023missingmissingmissing82.4772195.5331.1561.416278.009
4100027missing2.974missing43.4582170.712-1.9612.22214.17
5100031missingmissingmissing38.0063182.4491.3830.001253.461
6100038missing2.433missing77.0533114.6020.6190.165341.947
7100057missing1.151missing56.312214.449-0.0581.525270.759
8100078missingmissingmissing70.7862101.9343.1390.444172.721
9100079missingmissingmissingmissing1116.316-1.1710.641116.316
10100084missing0.49missing73.5663176.49-0.5582.664333.586
1991112539missingmissingmissingmissing1104.845-1.334-2.465104.845
1992112545missing2.135missing62.0193164.8750.848-1.408274.187
1993112547missing1.034missing49.6743114.0130.941-2.16235.454
1994112550missingmissingmissingmissing1112.6010.1241.47112.601
1995112553missing1.675missing60.9412199.28-0.427-1.397260.221
1996112554missingmissing2.389123.6963154.292-1.860.095376.563
1997112556missingmissing3.11871.0093106.3720.209-1.612223.655
1998112568missingmissing1.357199.5743208.952-1.6861.2440.235
1999112581missingmissingmissing70.1832117.8770.31-2.573188.06
2000112582missingmissingmissing48.5153106.2572.108-1.887202.71

Now we can look at the distribution of the separation:

@df interesting_jets histogram(:Jet_distance, label="Jet Separation")
@df interesting_jets marginalhist(:PRI_jet_leading_pt, :Jet_distance, label="Jet Separation by Primary pT", bins=20)