Various data mining algorithms have beenapplied by astronomers in like most of thedifferent applications in astronomy.But long-term researches and several mining projectshave been made by experts in this field of datamining making use of data related to the study ofastronomy because astronomy has created numerous magnificentdatasets that are flexibleto the approach along with numerousother areas like as medicine and high energy sciences ofphysics.Instances of projects are the SKICAT-Sky Image Catalogingand Analysis System for catalog formation and analysis technique, the catalog from digitizedskies surveys importantly thescans given by the second Palomar ObservatorySky Survey; the JAR Tool- Jet PropulsionLaboratory Adaptive Recognition Tool used forrecognition of volcanoes formed in over 30,000 images ofVenus which came by the Magellan mission; the following and more general Diamondand the Lawrence Livermore National Laboratory Sapphire project work. 1.
OBJECT CLASSIFICATION Classification is an crucial preliminary stepin the scientific method as it gives a way forarranging information in a method that may be used tomake hypotheses and compare easily with models. Thetwo most useful concepts in object classification arethe completeness and the efficiency, also known asrecall and precision. They are generally defined in termsof true and false positives (TP and FP) andtrue and false negatives(TN and FN). The completeness is the fraction of those objects that are in reality of a given typeare classified as that type: and theefficiency is the fraction ofobjects generally classifiedas a given type thatare genuinely of that type These two quantities are interesting astrophysically because while onerequires both higher completeness and efficiency there ismost commonly a tradeoff involved.
The paramount importance each oftenthe mostly depends on the application, for instance, an investigation of such rare objectsgenerally requires high completeness while allowing like some contamination (lower efficiency) butstatistical clustering of cosmological objects requires highefficiency even at the cost of completeness. ØStar-Galaxy Separation Due to the physical size in comparison to their distancefrom us,most of the stars are unresolved in datasetsrelating to photometry, and therefore appearas point sources. Galaxies despite being further away, generallysubtend a larger angle and appear as extendedsources. However, other astrophysical objects such as quasarsand supernovae, arealso seenas as point sources. Thus, the separationof photometric cataloginto stars and numerous galaxies, or more generally, stars,galaxies and otherobjects, is an important problem.
The number of galaxies andnumerous stars in typicalessential surveys (of the order of 108 or above) requires that such separation must beautomated. This problem is a well studied one and automatedapproaches were specifically employed before the current data miningalgorithms became famous, mostly for instance, during digitization done by the scanningof the various photographic plates by machines such as the APM and DPOSS.Several data miningalgorithms have been applied, including ANN,DT,mixture modelling and SOM with most algorithmsachieving over efficiency around 95%. Typically, this is performed using a set of measuredmorphological parameters that are madefrom the survey photometry,with perhaps colors or other information,such as the seeing. The advantageof data mining generalapproach is that all such information abouteach object is easily incorporated. Ø Galaxy Morphology Galaxies comein a range of numerous sizes and shapes, or more collectively,morphology. The most well-known system for themorphological classification ofgalaxies is the HubbleSequence of elliptical, spiral, barredspiral, and irregular, along with various subclasses.
Thissystem correlates to many physical properties knownto be crucial inthe formation and formation of galaxies. Becausegalaxy morphology is a tough and complex phenomenonthat correlates to the underlying the subject ofphysics, but is not unique to any one given process, theHubble sequence has shown, despite it being rathersubjective and based on visible-light morphology originally created from blue-biased photographic plates. The Hubble sequence has been extended in various othermethods, and for data mining purposes the T system has been extensively taken into consideration. This systemmaps the categorical Hubble typesE, S0, Sa, Sb, Sc,Sd, and Irr ontothe numerical values -5 to10. One can train a supervised algorithm to allotT types to images for which measured parameters are made available. Such parameters can be completely morphological, or comprise of other information such as color.
Aseries of papers written by Lahav and collaborators do exactly the same, by applying ANNs topredict theT type of galaxies at low redshift, andfinding equal amount of the real accuracy to human experts. ANNs have alsobeen applied to higher redshift data todistinguish between normal and unique galaxies and the fundamentallytopological and unsupervised SOM ANN has been used toclassify various galaxies from Hubble Space Telescopeimages, where the initial distribution of various classesis unknown. Likewise, ANNs havebeen used to obtain the morphological types from galaxy spectra. 2. PHOTOMETRIC REDSHIFTSAn area ofastrophysics that has greatly increased in popularity inthe last few years is the estimation of redshifts fromphotometric data(photo-zs). This is because, although the distances areless accurate than the ones obtained with spectra,the sheer number of objects with photometric measurementscan often make up for the reduction in individual accuracy by suppressing thestatistical noise of an ensemble calculation. The two mostcommon approaches to photo-zs are the template method and the empirical training the set method.
The template approach has manydifficult issues, comprising calibration, zero-points, priors, multi-wavelength performance (e.g., poor inthe mid-infrared), and difficulty handling missing or incomplete training data. We payattention in this review on the empirical approach,as it is an implementation of supervised learning. Ø Galaxies At low redshifts, the calculationof photometric redshifts for normal galaxiesis quite straightforward due to the break in the typicalgalaxy spectrum at 4000A. Thus, as a galaxy is redshifted withincreasing distance, the color (measured as a difference in magnitudes) changesrelatively smoothly.
As a result, both template and empiricalphoto-z approaches obtain similar outcomes, aroot-mean-square deviation of ~ 0.02 in redshift,which is near to the best possible result giventhe intrinsic spread in the properties. This has beenshown with ANNs SVMDT, kNN, empirical polynomial relations, numerous template-based studies, andseveral other procedures.
At higher redshifts, acheiving accurate resultsbecomes more difficult because the 4000A break is shiftedredward of the optical, galaxies are fainter andthus spectral data are sparser, and galaxies intrinsically evolve overtime. While supervised learning has been successfully used,beyond the spectral regime the obvious limitation arisesthat in order to reach the limiting magnitude of the photometric portions ofsurveys, extrapolation would be required. In this regime, or where only smalltraining sets are available, template-based results can be used, but withoutspectral information, the templates themselves are being extrapolated. However,the extrapolation of the templates is being done in a morephysically motivated manner. It is likely that themore general hybrid method ofusing empirical data to iteratively improve thetemplates or the semi-supervised proceduredescribed in will ultimately provide a more elegant solution. Anotherissue at higher redshift is that the available numbers of objects can becomequite small (in the hundreds or lesser),thus reintroducing the curse of dimensionality by asimple lack of objects in comparison to measured wavebands.
The methodsof dimension reduction can help to mitigate this effect.