EigenstratFormat.jl

Introduction

EigenstratFormat.jl is a library for the Julia programming language to work with the Eigenstrat data format.

The Eigenstrat format is often used in genetics. Its definition can be found at David Reich's laboratory and in more detail in the Eigensoft package.

Getting started

You can install EigenstratFormat from the Julia interpreter.

julia> ]
pkg> add EigenstratFormat
pkg> Press BACKSPACE
julia> using EigenstratFormat

After that you may want to take a look at the code examples.

  • 01_files.jl demonstrates basic file operations.
  • 02_hgdp.jl shows how to extract samples from the AADR database.
  • 03_pca.jl performs a PCA and creates a simple PCA plot.
  • 04_pca_nice.jl create a nice looking PCA plot with colored markers and a legend.

Human DNA samples

Some sources for human DNA samples in Eigenstrat format are:

Most of these samples are ancient, some are from modern pupulations.

Functions

EigenstratFormat._allelesMethod
_alleles(bytes::Vector{UInt8}, idx::Int64)

Return the number of variant alleles for a specified individual.

bytes: Row of bytes that encode an SNP for all individuals.

idx: Index of the individual.

source
EigenstratFormat._bitpairMethod
_bitpair(byte::UInt8, pos::Int64)

Extract 2 bits from a byte. The bit positions start at 0. Allowed are values 0, 1, 2, 3.

source
EigenstratFormat._encodeMethod
_encode(
    genotype::Tuple{Char, Char},
    byte::UInt8,
    position::Int64,
    reference_snp::Tuple{Char, Char}
)

Encode the given genotype in a byte at the given bitpair position (0..3).

Each SNP is characterized by a Tuple of two alleles. The reference_snp contains the reference allele and the derived allele.

source
EigenstratFormat._hashitMethod
_hashit(sequence::String)

Calculate the hashsum of a String.

This is basically the same method as in Nick Patterson's original C code.

However, the original C version uses 32 bit integer values and integer overflows which may result in undefined behavior on some machines. This method uses 64 bit integers to circumvent this problem.

This method will probably fail if the sequence contains non-ASCII Unicode characters. I do not know if this is defined in any way.

source
EigenstratFormat._hashsumMethod
_hashsum(sequences::Vector{<:AbstractString})

Calculate the hashsum of a Vector of Strings using the _hashit() method.

This is the same method as the hasharr function in Nick Patterson's C code.

source
EigenstratFormat.add_individualMethod
add_individual(
    inprefix::AbstractString,
    outprefix::AbstractString,
    ind_snp_file::AbstractString,
    id::AbstractString;
    gender = "U",
    status = "Control"
)

Add an individual to a database in Eigenstrat format.

The SNPs in the database remain untouched. If the individual displayes SNPs that are not listed in the database or multiallelic ones those SNPs are removed.

inprefix: Prefix of the input database.

outprefix: Prefix of the output database.

ind_snp_file: File containing SNP results for the individual. This should work with files from Family Tree DNA Family Finder, MyHeritage, LivingDNA and 23andMe.

id: ID of the individual. For living persons I recommed the name.

gender: U, F or M (Unknown, Female or Male)

status: Control, Case or a population label.

source
EigenstratFormat.hash_idsMethod
hash_ids(filename::AbstractString)

Create the hashsum of .snp and .ind files. SNP and Individual files contain an ID in the first column. This method uses those IDs to calculate a hashsum. The sums are needed to store genotype data in packed format.

source
EigenstratFormat.impute_missingMethod
impute_missing(genomatrix::Matrix{<:Real})

Impute missing values by replacing them with mean values.

Return a Matrix{Float64}.

It is assumed that all SNPs are biallelic. This function can really eat up lots of memory because the computation is not done in place and the result is a Float64 matrix. Maybe we should stay with UInt8 and use rounded values for missing alleles?

geno contains the number of reference (or derived) alleles for each sample and SNP. Each row represents one SNP. Each sample is represented by a column.

source
EigenstratFormat.pca!Method
pca!(geno::Matrix{<:Real}; contains_reference = true, ncomponents = 25, scaling = "genetic drift")

Perform PCA analysis on Eigenstrat data using SVD (Single value decomposition).

This method does some computations in place to save memory. This changes the genomatrix.

You can also chose if missing values in the genomatrix should be imputed. if impute = true the genomatrix is not changed.

Return a MulitvariateStats.PCA that can be used to compute the PCA components of a genomatrix by calling pca_components.

geno is a genomatrix that was read by read_eigenstrat_geno. Usually the matrix contains the number of reference alleles for each SNP and individual. geno must be a valid matrix with no missing values or invariant markers. If you are not sure call remove_invariant!(geno) and impute_missing(geno) before this function.

ncomponents is the maximum number of Eigenvalues for the PCA. Default is 25.

scaling scales the number of alleles acccording to different methods. Standard is genetic drift like recommended by Nick Patterson in Population Structure and Eigenanalysis.

contains_reference describes if the genomatrix contains the number of reference alleles or derived alleles. Standard is true like documented in David Reichs's Input File Formats. This is also how Nick Patterson's smartpca program interprets the file format.

However the R package smartsnp as described in smartsnp, an r package for fast multivariate analyses of big genomic data assumes that the genomatrix contains the number of derived alleles. To stay compatible with the R package choose contains_reference = false.

source
EigenstratFormat.pca_coordinatesMethod
pca_coordinates(M::MultivariateStats.PCA, geno::Matrix{<:Real}, sample_ids::Vector{<:AbstractString})

Return PCA coordinates for all samples of a given genomatrix and PCA M.

The result is a Dataframe which contains a sample ID followed by the PCA coordinates in each row.

ID PC1 PC2 ...

source
EigenstratFormat.read_eigenstrat_genoMethod
read_eigenstrat_geno(
    genofile::AbstractString,
    nsnp::Int64,
    nind::Int64;
    ind_idx::Vector{Int64} = [i for i = 1:nind]
)

Read a genofile in PackedAncestryMap format. The file must be unzipped.

genofile: filename

nsnp: number of SNPs listed in .snp file.

nind: number of individuals listed in .ind file.

ind_idx: Indices of individuals that should be read from the file.

XXX Check for comment lines in .snp and .ind files.

File description: File header starts with GENO or TGENO (transposed GENO). So far files in the AADR archive seem to be GENO. So this method does not support the transposed TGENO format.

The text format contains one line per genotype:

SNPID SampleID Numberofvariant_alleles

The packed format:

Each SNP entry has 2 bits: 0, 1, 2, 3=missing, that denote the number of variant alleles as described at David Reich's laboratory.

source
EigenstratFormat.read_eigenstrat_indMethod
read_eigenstrat_ind(indfile::AbstractString)

Read individuals from Eigenstrat .ind file. The IND flle contains information about each individual in the database.

Return a DataFrame consisting of the columns:

ID, Gender, Status

where

Gender: M (male), F (Female) or U (unknown).

Status: Case, Control or population label.

source
EigenstratFormat.read_eigenstrat_snpMethod
read_eigenstrat_snp(snpfile::AbstractString)

Read Eigenstrat .snp file. The SNP file contains information about each SNP.

Return a DataFrame containing the columns:

chromosome, rsid, cM, position, allele1, allele2

source
EigenstratFormat.read_snp_fileMethod
read_snp_file(filename::AbstractString)

Read file with autosomal results from FTDNA Family Finder, My Heritage or LivingDNA. Should also work with 24andMe files but not tested.

Return a DataFrame containing the columns:

rsid chromosome position genotype

source
EigenstratFormat.remove_invariant!Method
remove_invariant!(geno::Matrix{<:Real})

Remove invariant markers from a genomatrix. Return a view to the cahanged geno matrix that contains only valid markers.

Because geno matrices can get very big the original matrix is changed in place and is no longer valid.

geno is the matrix. Each row represents one single marker for multiple samples.

source
EigenstratFormat.write_23andMeMethod
write_23andMe(filename::AbstractString, snptable)

Write a table of SNPs in 23andMe file format.

This method is included to satisfy users who use Plink. Plink supports 23andMe files.

The snp table must satisfy the Tables.jl interface.

The table must contain the columns:

rsid, chromosome, position, genotype

However exact spelling is not mandatory.

source
EigenstratFormat.write_eigenstrat_genoMethod
write_eigenstrat_geno(
    genofile::AbstractString,
    genomatrix::Matrix{UInt8};
    ind_hash::Int64 = 0,
    snp_hash::Int64 = 0
)

Write a genotype matrix to file in PackedAncestryMap format.

ind_hash: Hashsum of .ind file.

snp_hash: Hashsum of .snp file.

source
EigenstratFormat.write_eigenstrat_indMethod
write_eigenstrat_ind(filename::AbstractString, inds::AbstractDataFrame)

Write information about each individual to an .ind file.

The parameter inds contains information about each individual.

The DataFrame must have the columns

ID, Gender, Status

source
EigenstratFormat.write_eigenstrat_snpMethod
write_eigenstrat_snp(filename::AbstractString, snps::DataFrame)

Write .snp file in Eigenstrat format. The SNP file contains information about each SNP. The SNPs are provided as a DataFrame in parameter snps.

The DataFrame must consist of the following columns:

chromosome, rsid, cM, position, allele1, allele2

source

Index