Copyright 2014 Sebastian Raschka

smilite

smilite is a Python module to download and analyze SMILE strings (Simplified Molecular-Input Line-entry System) of chemical compounds from ZINC (a free database of commercially-available compounds for virtual screening, http://zinc.docking.org).
Now supports both Python 3.x and Python 2.x.

Sections

Installation
Documentation
Command Line Scripts Examples
      - gen_zincid_smile_csv.py (downloading SMILES)
      - comp_smile_strings.py (checking for duplicates within 1 file)
      - comp_2_smile_files.py (checking for duplicates across 2 files)
Contact
Changelog



Installation

You can use the following command to install smilite:
pip install smilite
or
easy_install smilite

Alternatively, you download the package manually from the Python Package Index https://pypi.python.org/pypi/smilite, unzip it, navigate into the package, and use the command:

python3 setup.py install



Documentation

After you installed the smilite module, you can import it in Python via import smilite. The current functions include:

def get_zinc_smile(zinc_id):
    """
    Gets the corresponding SMILE string for a ZINC ID query from
    the ZINC online database. Requires an internet connection.

    Keyword arguments:
        zinc_id (str): A valid ZINC ID, e.g. 'ZINC00029323'

    Returns the SMILE string for the corresponding ZINC ID.
        E.g., 'COc1cccc(c1)NC(=O)c2cccnc2'

    """
def simplify_smile(smile_str):
    """ 
    Simplifies a SMILE string by removing hydrogen atoms (H), 
    chiral specifications ('@'), charges (+ / -), '#'-characters,
    and square brackets ('[', ']').

    Keyword Arguments:
        smile_str (str): A smile string, e.g., C[C@H](CCC(=O)NCCS(=O)(=O)[O-])
    
    Returns a simplified SMILE string, e.g., CC(CCC(=O)NCCS(=O)(=O)O)

    """
def generate_zincid_smile_csv(zincid_list, out_file, print_progress_bar=False):
    """
    Generates a CSV file of ZINC_ID,SMILE_string entries by querying the ZINC online
    database.

    Keyword arguments:
        zincid_list (str): Path to a UTF-8 or ASCII formatted file 
             that contains 1 ZINC_ID per row. E.g., 
             ZINC0000123456
             ZINC0000234567
             [...]
        out_file (str): Path to a new output CSV file that will be written.
        print_prgress_bar (bool): Prints a progress bar to the screen if True.

    """
def check_duplicate_smiles(zincid_list, out_file, compare_simplified_smiles=False):
    """
    Scans a ZINC_ID,SMILE_string CSV file for duplicate SMILE strings.

    Keyword arguments:
        zincid_list (str): Path to a UTF-8 or ASCII formatted file that 
               contains 1 ZINC_ID + 1 SMILE String per row.
               E.g., 
               ZINC12345678,Cc1ccc(cc1C)OCCOc2c(cc(cc2I)/C=N/n3cnnc3)OC
               ZINC01234567,C[C@H]1CCCC[NH+]1CC#CC(c2ccccc2)(c3ccccc3)O
               [...]
        out_file (str): Path to a new output CSV file that will be written.
        compare_simplified_smiles (bool): If true, SMILE strings will be simplified
               for the comparison.
       
    """
def comp_two_files(zincid_list1, zincid_list2, out_file, compare_simplified_smiles=False):
    """
    Compares SMILE strings across two ZINC_ID files for duplicates 
    (does not check for duplicates within each file).

    Keyword arguments:
        zincid_list1 (str): Path to a UTF-8 or ASCII formatted file that 
               contains 1 ZINC_ID + 1 SMILE String per row.
               E.g., 
               ZINC12345678,Cc1ccc(cc1C)OCCOc2c(cc(cc2I)/C=N/n3cnnc3)OC
               ZINC01234567,C[C@H]1CCCC[NH+]1CC#CC(c2ccccc2)(c3ccccc3)O
               [...]
        zincid_list2 (str): Second ZINC_ID list file, similarly 
        out_file (str): Path to a new output CSV file that will be written.
        compare_simplified_smiles (bool): If true, SMILE strings will be simplified
               for the comparison.
       
    """



Command Line Scripts Examples

If you downloaded the smilite package from https://pypi.python.org/pypi/smilite or https://github.com/rasbt/smilite, you can use the command line scripts I provide in the scripts/ dir.



gen_zincid_smile_csv.py (downloading SMILES)

Generates a ZINC_ID,SMILE_STR csv file from a input file of ZINC IDs. The input file should consist of 1 columns with 1 ZINC ID per row.

Usage:
[shell]>> python3 gen_zincid_smile_csv.py in.csv out.csv

Example:
[shell]>> python3 gen_zincid_smile_csv.py ../examples/zinc_ids.csv ../examples/zid_smiles.csv

Screen Output:

Downloading SMILES
0%                          100%
[##########                    ] | ETA[sec]: 106.525 


Input example file format:

zinc_ids.csv


Output example file format:

zid_smiles.csv



comp_smile_strings.py (checking for duplicates within 1 file)

Compares SMILE strings within a 2 column CSV file (ZINC_ID,SMILE_string) to identify duplicates. Generates a new CSV file with ZINC IDs of identified duplicates listed in a 3rd-nth column(s).

Usage:
[shell]>> python3 comp_smile_strings.py in.csv out.csv [simplify]

Example 1:
[shell]>> python3 gen_zincid_smile_csv.py ../examples/zinc_ids.csv ../examples/zid_smiles.csv


Input example file format:

zid_smiles.csv


Output example file format 1:

comp_smiles.csv


Where
- 1st column: ZINC ID
- 2nd column: SMILE string
- 3rd column: number of duplicates
- 4th-nth column: ZINC IDs of duplicates


Example 2:
[shell]>> python3 comp_smile_strings.py ../examples/zid_smiles.csv ../examples/comp_simple_smiles.csv simplify


Output example file format 2:
comp_simple_smiles.csv



comp_2_smile_files.py (checking for duplicates across 2 files)

Compares SMILE strings between 2 input CSV files, where each file consists of rows with 2 columns ZINC_ID,SMILE_string to identify duplicate SMILE string across both files.
Generates a new CSV file with ZINC IDs of identified duplicates listed in a 3rd-nth column(s).

Usage:
[shell]>> python3 comp_2_smile_files.py in1.csv in2.csv out.csv [simplify]

Example:
[shell]>> python3 comp_2_smile_files.py ../examples/zid_smiles2.csv ../examples/zid_smiles3.csv ../examples/comp_2_files.csv


Input example file 1:

zid_smiles2.csv


Input example file 2:

zid_smiles3.csv


Output example file format:

comp_2_files.csv


Where:
- 1st column: name of the origin file
- 2nd column: ZINC ID
- 3rd column: SMILE string
- 4th-nth column: ZINC IDs of duplicates



Contact

If you have any questions or comments about smilite, please feel free to contact me via
eMail: se.raschka@gmail.com
or Twitter: @rasbt



Changelog

VERSION 1.3.0

VERSION 1.2.0

VERSION 1.1.1

VERSION 1.1.0